Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_14

14. Compositional Analysis of Microbiome Data

Yinglin Xia¹ and Jun Sun ¹

(1)

Department of Medicine, University of Illinois Chicago, Chicago, IL, USA

Abstract

This chapter focuses on compositional data analysis (CoDA). First, it describes overall compositional data, the rationale that microbiome data can be treated as compositional, Aitchison simplex, challenges of analysis of compositional data, fundamental principles of CoDA, and the family of log-ratio transformations. Then, it introduces three methods/models of CoDA: ANOVA-like compositional differential abundance analysis (ALDEx2), analysis of composition of microbiomes (ANCOM), analysis of composition of microbiomes-bias correction (ANCOM-BC). Next, it provides some remarks on CoDA approach.

Keywords

Compositional data Aitchison simplex Log-ratio transformations Additive log-ratio (alr) Centered log-ratio (clr) Isometric log-ratio (ilr) Inter-quartile log-ratio (iqlr) ALDEx2 Analysis of composition of microbiomes (ANCOM) Analysis of composition of microbiomes-bias correction (ANCOM-BC) ANCOMBC package Scaling invariance Subcompositional coherence Permutation invariance Perturbation invariance Subcompositional dominance Bland-Altman plot Difference plot Tukey mean-difference plot Effect size xtable() xtable package edgeR DESeq q2-composition qiime composition ancom False positive rate (FDR) Holm Bonferroni Bonferroni correction method Benjamini Hochberg (BH) procedure

This chapter focuses on compositional data analysis (CoDA). It is organized this way: First, we describe overall compositional data, the reasons that microbiome data can be treated as compositional, Aitchison simplex, challenges of analysis of compositional data, some fundamental principles of CoDA, and the family of log-ratio transformations (Sect. 14.1). Then, we introduce three methods or models of CoDA: ANOVA-like compositional differential abundance analysis (ALDEx2) (Sect. 14.2), analysis of composition of microbiomes (ANCOM) (Sect. 14.3), and analysis of composition of microbiomes-bias correction (ANCOM-BC) (Sect. 14.4). Next, we make some remarks on CoDA approach (Sect. 14.5). Finally, we complete this chapter with a brief summary of CoDA in Sect. 14.6.

14.1 Introduction to Compositional Data

14.1.1 What Are Compositional Data?

Composition is “the act of putting together parts or elements to form a whole” and is “the way in which such parts are combined or related: constitution” (Webster’s II New College Dictionary, 2005, p.236) [also see (Xia et al. 2018a)]. As described in Aitchison’s 1986 seminar work (Aitchison 1986b), a compositional dataset has the following four characteristics: (1) each row presents an observational unit; (2) each column presents an composition of whole; (3) each entry is non-negative; and (4) the sum of all the entries in each row equals to 1 or 100%. That is, compositional data quantitatively measure each element as a composition or describe the parts of the whole, a vector of non-zero positive values (i.e., components or parts) carrying only relative information between their components or parts (Pawlowsky-Glahn et al. 2015; Hron et al. 2010; Egozcue and Pawlowsky-Glahn 2011; Aitchison 1986a). Thus, compositional data exist as the proportions/fractions of a whole or the portions of a total (van den Boogaart and Tolosana-Delgado 2013a) and have the following unique properties: (1) The elements of the composition are non-negative and sum to unity (Bacon-Shone 2011; Xia et al. 2018a). (2) The total sum of all component values (i.e., the library size) is an artifact of the sampling procedure (Quinn et al. 2018a, b; van den Boogaart and Tolosana-Delgado 2008). (3) The difference between component values is only meaningful proportionally (Quinn et al. 2018a, b; van den Boogaart and Tolosana-Delgado 2008). The unique properties, especially adding up all the percentages of compositions necessarily to 100, introduces effects on correlations (Krumbein 1962).

14.1.2 Microbiome Data Are Treated as Compositional

Compositional data analysis is really only interested in relative frequencies rather than the absolute amount of data. Thus, compositional data analysis frequently arises in various research fields, including genomics, population genetics, demography, ecology, biology, chemistry, geology, petrology, sedimentology, geochemistry, planetology, psychology, marketing, survey analysis, economics, probability, and statistics.

Microbiome data are really count data. However, when microbiome study arises, some researchers consider or treat microbiome data as compositional mainly based on the following three reasons (see the details in (Xia et al. 2018a)):

(1)
The high-throughput sequencing technology itself predefines or constrains the sequencing data including microbiome data to some constants (each sample read counts are constrained by an arbitrary total sum, i.e., library size) when sequencing platforms were used to generate the data (Quinn et al. 2018a, b), resulting in the total values of the data meaningless. Regardless the datasets are generated via 16S rRNA gene fragments sequencing or shotgun metagenomic sequencing, the observed number of reads (sequencing depth) is determined by the capacity of the sequencing platform used and the number of samples that are multiplexed in the run (Fernandes et al. 2014). Thus, the total reads mapped from the high-throughput sequencing methods are finite although they are large.
(2)
Sample preparation and DNA/RNA extraction process cause each entry of microbiome datasets to carry only relative information in the measurements of feature (OTU or taxa abundance) (Lovell et al. 2011). RNA sequencing begins with extraction of the tissue of DNA/RNA samples. However, the tissue weight or volume is fixed. Thus, the number of sequence fragment reads obtained from a fixed volumes of total RNA is finite.
(3)
In practice, to reduce experimental biases due to sampling depth and the biases of sample preparation and DNA/RNA extraction, the abundance read counts are typically divided by the total sum of counts in the sample(total library sizes) to normalize the data before analysis. All these result in microbiome dataset having compositional data structure and the abundance of each component (e.g., taxon/OTU) carries only relative information and hence is only coherently interpretable relative to other components within that sample (Quinn et al. 2018a, b).

14.1.3 Aitchison Simplex

Mathematically, a data is defined as compositional, if it contains D multiple parts of nonnegative numbers whose sum is 1 (Aitchison 1986a, p. 25) or any constant-sum constraint (Pawlowsky-Glahn et al. 2015, p. 10). It can be formally stated as:

${S}^D=\left\{X=\left[{x}_1,{x}_2,\dots, {x}_D\right]\left|{x}_i>0,i=1,2,\dots, D;\sum \limits_{i=1}^D{x}_i=\kappa \right.\right\}.$

(14.1)

That is, compositional data can be represented by constant sum real vectors with positive components. Where 𝜅 is arbitrary. Depending on the units of measurement or rescaling, frequent values are 1 (per unit, proportions), 100 (percent, %), (ppm, parts per million), and (ppb, parts per billion). The equation of 14.1 defines the sample space of compositional data as a hyperplane, called the simplex (Aitchison 1986a, p. 27). Also see (Mateu-Figueras et al. 2011; van den Boogaart and Tolosana-Delgado 2013a, p. 37) and (Pawlowsky-Glahn et al. 2015, p. 10). Compositional data do not exist in real Euclidean space, but rather in the simplex (a sub-space) (Aitchison 1986a).

14.1.4 Challenges of Analyzing Compositional Data

Standard methods (e.g., correlation analysis) rely on the assumption of Euclidean geometry in real space (i.e., P) (Eaton 1983) and assume that the differences between the tested variables are linear or additive. However, the simplex has one dimension less than real space (i.e., P-1), which represent special properties of the sample space (Aitchison 1986a). Therefore, in compositional data each component (or part) of the whole is dependent on other components. Because the sum of all proportions of each component equals to 1 (100%), at least two components are negatively correlated. This results in the dependency problem of the measured variables (e.g., taxa/OTUs/ASVs). In the statistical and especially in current microbiome literatures, the issue of dealing with proportions is referred as to compositionality. Microbial abundances are typically interpreted as proportions (e.g., relative abundance) to account for differences in sequencing depth (see Sect. 14.1.2 for details). Because proportions add to one, the change of a single microbial taxon will also change the proportions of the remaining microbial taxa. Therefore, it poses a challenge to infer exactly which microbial taxon is changing with treatments or conditions.

Compositionality makes most standard statistical methods and tests invalid. The challenges of analyzing compositional data have been described in our previous books (Xia 2020; Xia and Sun 2022a, b). Here, we summarize the main point as below.

It was shown that two-sample t-tests and Wilcoxon rank sum tests have a higher FDR and low power to detect differential abundance because of ignoring the compositionality or dependency effect (microbial taxa in the same niche grow dependently) (Hawinkel et al. 2017).
Analyzing compositional (or relative) data using Pearson and Spearman correlations will lead to the problem of “spurious correlation” between unrelated variables (Pearson 1897). Compositional data are not linear or monotonic; instead they exhibit dependence between components. Pearson and Spearman correlation analysis methods were originally proposed for absolute values. The dependence of each pair of components of compositional data violates the assumption that the paired data are randomly selected (independent) by linear (i.e., for Pearson) or rank (i.e., for Spearman) correlation.
Visualizing or presenting compositional data using standard graphical tools (e.g., scatter plot, QQ plot) will result in a distorted graph.
Analyzing compositional data using multivariate parametric models (e.g., MANOVA) violates the assumption of multivariate parametric analysis. Standard multivariate parametric models, such as MANOVA and multivariate linear regression, assume that the response variables are multivariate and normally distributed and have a linear relationship between response variables and predictors. However, compositional data are not multivariate normally distributed. Thus, using MANOVA (or ANOVA) and multivariate (or univariate) linear regression to test hypotheses on the response variable is meaningless due to dependence of the compositions.

14.1.5 Fundamental Principles of Compositional Data Analysis

Aitchison proposed and suggested (Aitchison 1982, 1986a) three fundamental principles for the analysis of compositional data that we should adhere to when analyzing compositional data. These fundamental principles are all rooted in the definition of compositional data: only ratios of components carry information and have been reformulated several times according to new theoretical developments (Barceló-Vidal et al. 2001; Martín-Fernández et al. 2003; Aitchison and Egozcue 2005; Egozcue 2009; Egozcue and Pawlowsky-Glahn 2011). The three fundamental principles are: (1) scaling invariance; (2) subcompositional coherence; and (3) permutation invariance.

In 2008, Aitchison summarized basic principle of compositional data analysis. For its formal expression, we cannot do better than to quote his definition (Aitchison 2008):

Any meaningful function of a composition can be expressed in terms of ratios of the components of the composition. Perhaps equally important is that any function of a composition not expressible in terms of ratios of the components is meaningless.

Scaling invariance states that vectors with proportional positive components must be treated (analyzed) as representing the same composition (Lovell et al. 2015). That is, statistical inferences about compositional data should not depend on the scale used; i.e., a composition is multiplied by a constant k will not change the results (van den Boogaart and Tolosana-Delgado 2013b). Thus, the vector of per-units and the vector of percentages convey exactly the same information (Egozcue and Pawlowsky-Glahn 2011). We should obtain exactly the same results from analyzing proportions and percentages. For example, the vectors a = [11, 2, 5], b = [110, 20, 50], and c = [1100, 200, 500] represent all the same composition because the relative importance (the ratios) between their components is the same (van den Boogaart and Tolosana-Delgado 2013b).
Subcompositional coherence states that analyses should depend only on data about components (or parts) within that subset, not depend on other non-involved components (or parts) (Egozcue and Pawlowsky-Glahn 2011); and statistical inferences about subcompositions (a particular subset of components) should be consistent, regardless of whether the inference is based on the subcomposition or the full composition (Lovell et al. 2015).
Permutation invariance states that the conclusions of a compositional analysis should not depend on the order (the sequence) of the components (the parts) (Egozcue and Pawlowsky-Glahn 2011; van den Boogaart and Tolosana-Delgado 2013b; Lovell et al. 2015). That is, in compositional analysis, the information from the order of the different components plays no role (i.e., changing the order of the components within a composition will not change the results) (van den Boogaart and Tolosana-Delgado 2013b; Quinn et al. 2018a, b). For example, it does not matter that we choose which component to be the “first,” which component to be the “second,” and which one to be the “last.”

Other important properties of CoDa include perturbation invariance and subcompositional dominance. Perturbation invariance states that converting a composition between equivalent units will not change the results (van den Boogaart and Tolosana-Delgado 2013b; Quinn et al. 2018a, b).

Subcompositional dominance states that using a subset of a complete composition carries less information than using the whole (van den Boogaart and Tolosana-Delgado 2013b; Quinn et al. 2018a, b).

14.1.6 The Family of Log-Ratio Transformations

Compositional data exist in the Aitchison simplex (Aitchison 1986a), while the most statistical methods or models are valid in Euclidean space (real space). Aitchison (1981, 1982, 1983, 1984, 1986a; Aitchison and Egozcue 2005) approved that compositional data could be mapped into Euclidean space by using the log-ratio transformation and then can be analyzed using standard statistical methods or models such as an Euclidean distance metric (Aitchison et al. 2000). Thus, usually most compositional data analyses start with a log-ratio transformation (Xia et al. 2018a).

The algorithm behind the log-ratio transformation principle is that compositional vectors and associated log-ratio vectors exist a one-to-one correspondence so that any statement about compositions can be reformed in terms of log-ratios and vice versa (Pawlowsky-Glahn et al. 2015). Thus, log-ratio transformations solve the problem of a constrained sample space by projecting compositional data in the simplex into multivariate real space. Therefore, open up all available standard multivariate techniques (Pawlowsky-Glahn et al. 2015).

In order to transform the simplex to the real space, Aitchison in his seminal work (1986a, b) developed a set of fundamental principles, a variety of methods, operations, and tools, including the additive log-ratio (alr) for compositional data analysis.

The log-ratio transformation methodology has been accepted in various fields such as geology and ecology (Pawlowsky-Glahn and Buccianti 2011; van den Boogaart and Tolosana-Delgado 2013a; Aitchison 1982; Pawlowsky-Glahn et al. 2015). Here, we introduce the family of log-ratio transformations including the additive log-ratio (alr) (Aitchison 1986a, p. 135), the centered log-ratio (clr) (Aitchison 2003), the inter-quartile log-ratio (iqlr) (Fernandes et al. 2013, 2014), and isometric log-ratio (ilr) (Egozcue et al. 2003) transformations.

14.1.6.1 Additive Log-Ratio (alr) Transformation

The original approach of compositional data analysis proposed in Aitchison (1986a, b) was based on the alr-transformation. It is defined as:

$\mathrm{alr}(x)=\left[\ln \left(\frac{x_1}{x_D}\right),\dots, \ln \left(\frac{x_i}{x_D}\right),\dots, \ln \left(\frac{x_{D-1}}{x_D}\right)\right].$

(14.2)

This formula maps a composition in the D-part Aitchison simplex none isometrically to a D-1 dimensional Euclidean vector. The alr-transformation chooses one component as a reference and takes the logarithm of each measurement within a composition vector (i.e., in the microbiome case, each sample vector containing relative abundances) after divided by a reference taxon (usually the taxon with index D, with D being the total number of taxon is chosen). Sum of all X_j to unity, but we can replace the components with any observed counts, which do not change the expression due to the library sizes cancelation. After the alr-transformation, any separation between the groups revealed by the ratios can be analyzed by standard statistical tools (Thomas and Aitchison 2006).

14.1.6.2 Centered Log-Ratio (clr) Transformation

The clr-transformation is defined as the logarithm of the components after dividing by the geometric mean of x:

$clr(x)=\left[\ln \left(\frac{x_1}{g_m(x)}\right),\dots, \ln \left(\frac{x_i}{g_m(x)}\right),\dots, \ln \left(\frac{x_D}{g_m(x)}\right)\right],$

(14.3)

with ${g}_m(x)=\sqrt[D]{x_1\cdot {x}_2\cdots {x}_D}$ ensuring that the sum of the elements of clr(x) is zero. Where x = (x₁, …, x_i, …, x_D) represents the composition. The clr-transformation maps a composition in the D-part Aitchison simplex isometrically to a D-1 dimensional Euclidean vector. Unlike the alr-transformation in which a specific taxon is used as reference, the clr-transformation uses the geometric mean of the composition (i.e., sample vector) in place of x_D. Like the alr, after performing the clr-transformation, the standard unconstrained statistical methods can be used for analyzing compositional data (Aitchison 2003).

The clr-transformation algorithm has been adopted by some software (van den Boogaart and Tolosana-Delgado 2013b; Fernandes et al. 2013), and it was shown it could be used to analyze microbiome data and RNA-seq data and other next-generation sequencing data (Fernandes et al. 2014).

14.1.6.3 Isometric Log-Ratio (ilr) Transformation

The ilr-transformation is defined as:

$y= ilr(x)=\left({y}_i,\dots, {y}_{D-1}\right)\in {R}^{D-1}.$

(14.4)

where, ${y}_i=\frac{1}{\sqrt{i\left(i+1\right)}}\ln \left[\frac{\Pi_{j=1}^i{x}_j}{\left({x}_i+1\right)i}\right].$

Like the clr, the ilr-transformation maps a composition in the D-part Aitchison simplex isometrically to a D-1 dimensional Euclidian vector. This ilr-transformation is an orthonormal isometry. It is the product of the clr and the transpose of a matrix which consists of elements. The elements are clr-transformed components of an orthonormal basis. The ilr-transformation transforms the data regarding an orthonormal coordinate system that performed from sequential binary partitions of taxa (van den Boogaart and Tolosana-Delgado 2013b).

Like alr and clr, ilr-transformation was developed to transform compositional data from the simplex into real space where standard statistical tools can be applied (Egozcue and Pawlowsky-Glahn 2005; Egozcue et al. 2003). That is, the ilr-transformed data can be analyzed using the standard statistical methods.

14.1.6.4 Inter-quartile Log-Ratio (iqlr) Transformation

The inter-quartile log-ratio (iqlr) transformation introduced in the ALDEx2 package (Fernandes et al. 2013, 2014) is defined as including only taxa that fall within the inter-quartile range of total variance in the geometric mean calculation. That is, the iqlr-transformation uses the geometric mean of a taxon subset as the reference.

14.1.7 Remarks on Log-Ratio Transformations

The difference of the three log-ratio (alr, clr, and ilr) transformations lies on choosing the divisor. Each transformation has its own strengths and weaknesses.

(1)
alr-transformation

The strengths of alr-transformation are:
- It is the simplest transformation in the log-ratio transformation family and hence is relatively simple to interpret the results.
- The relation to the original D-1 first parts is preserved and it is still in wide use.
The weaknesses of alr-transformation are:
- It is not an isometric transformation from the Aitchison simplex metric into the real alr-space with the ordinary Euclidean metric. Although this weakness could be solved using an appropriate metric with oblique coordinates in real additive log-ratio (alr) space. However, it is not a standard practice (Aitchison and Egozcue 2005).
- Theoretically, we cannot use the standard statistical methods such as ANOVA and t-test to analyze the alr-transformed data. Although this weakness is a conceptual rather than practical problem (Aitchison et al. 2000) and this transformation was used in Aitchison (1986a, b) and further developed in Aitchison et al. (2000), by definition, the alr-transformation is asymmetric in the parts of the composition (Egozcue et al. 2003); thus, the distances between points in the transformed space are not the same for different divisors (Bacon-Shone 2011).
- The practical problem is not always easy to choose an obvious reference and the choice of reference taxon is somewhat arbitrary (Li 2015), and results may vary substantially when the difference references are chosen (Tsilimigras and Fodor 2016). This may explain that the alr-transformation was an optional transformation approach but not default approach in Analyzing Compositional Data with R (van den Boogaart and Tolosana-Delgado 2013a, b).

(2)
clr-transformation

The strengths of clr-transformation are:
- It avoids the alr-transformation problem of choosing a divisor (e.g., using one reference taxon) because the clr-transformation uses the geometric mean as the divisor.
- It is an isometric transformation of the simplex with the Aitchison metric, onto a subspace of real space with the ordinary Euclidean metric (Egozcue et al. 2003).
- Thus, it is most often used transformation in literature of compositional data analysis (CoDA) (Xia et al. 2018a).
The weaknesses of clr-transformation are:
- The clr covariance matrix is singular, thus it is difficult to use without adaption some standard statistical procedures (Bacon-Shone 2011).
- Its prominent weakness is not straightforward to obtain the orthogonal references in its subspace (Egozcue et al. 2003).

(3)
ilr-transformation

The strengths of ilr-transformation are:
- It has significant conceptual advantages (Bacon-Shone 2011).
- It avoids the arbitrariness of alr and the singularity of clr; thus, it addresses certain difficulties of alr and clr and the ilr-transformed data can be analyzed using all the standard statistical methods.
However, the ilr-transformation has the weaknesses and been criticized because of the following difficulties:
- It has a major difficulty to naturally model the practical compositional situations in terms of a sequence of orthogonal logcontrasts, which is in contrast to the practical use of othogonality when using the simplicial singular value decomposition (Aitchison 2008). In other words, the ilr-transformation approach violates the practical use of the principal component analysis or principal logcontrast analysis in CoDA (Aitchison 1983, 1986a).
- Ensuring isometry has little to do with this compositional problem, and actually the coordinates in any ilr-transformation necessarily require a set of orthogonal logcontrasts, which in clinical practice lacks of interpretability (Aitchison 2008) or its interpretability is dependent on the selection of its basis (Egozcue et al. 2003).
- There is no one-to-one relation between the original components and the transformed variables.

Due to these limitations and specifically the difficulty to interpret the results. In practice, ilr has somewhat limited its adoption or application (Egozcue et al. 2003; Xia et al. 2018a).

However, compared to alr-transformation that is not isometric and clr-transformation that is not an isomorphism, the isometric log-ratio transformation (ilr) is both an isomorphism and an isometry (so it is also known as balance). Thus, orthonormal coordinates can be defined using this transformation. Recently the ilr-transformation and the concept of groups of parts and their balances in CoDA (Egozcue and Pawlowsky-Glahn 2005) have been seen for the possible interesting applications in microbiome study to transform a rooted phylogeny and build the balances of taxa in inter-group analysis (Washburne et al. 2017; Rivera-Pinto et al. 2018). With the ilr-transformation, a sequential binary partition is used to construct a new set of coordinates, in application of microbiome, the sequential binary partition of the phylogeny is conducted in microbiome datasets until produces D-1 coordinates, x_i (called “balances”) (Rivera-Pinto et al. 2018). Thus the statistical analysis will be performed on the “balances” (Rivera-Pinto et al. 2018) rather than the relative abundances, y_i, of D different OTUs/Taxa.

(4)
iqlr-transformation

Compared to other three log-ratio transformations, the iqlr-transformation is not getting widely applied.

14.2 ANOVA-Like Compositional Differential Abundance Analysis

Previously, most existing tools for compositional data analysis have been used in the other fields, such as geology and ecology. ANOVA-Like Differential Expression (ALDEx) analysis is one of early statistical methods that were developed under a framework of compositional analysis for mixed population RNA-seq experiment.

14.2.1 Introduction to ALDEx2

ALDEx (Fernandes et al. 2013) and ALDEx2 (Fernandes et al. 2014) were initially developed for analyzing differential expression of mixed population RNA sequencing (RNA-seq) data, but it has been showed that this approach is essentially generalizable to nearly any type of high-throughput sequencing data, including three completely different experimental designs: the traditional RNA-seq, 16S rRNA gene amplicon-sequencing, and selective growth-type (SELEX) experiments (Fernandes et al. 2014; Gloor and Reid 2016; Gloor et al. 2016; Urbaniak et al. 2016). The R packages called ALDEx and ALDEx2 have been used to analyze unified high-throughput sequencing datasets such as RNA-seq, chromatin immunoprecipitation sequencing (ChIP-seq), 16S rRNA gene sequencing fragments, metagenomic sequencing, and selective growth experiments (Fernandes et al. 2014).

High-throughput sequencing (e.g., microbiome) data have several sources of variance including sampling replication, technical replication, variability within biological conditions, and variability between biological conditions. ALDEx and ALDEx2 were developed in a traditional ANOVA-like framework that decompose sample-to-sample variation into four parts: (1) within-condition variation, (2) between-condition variation, (3) sampling variation, and (4) general (unexplained) error.

Fernandes et al. (2013) highlighted the importance for partitioning and comparing biological between-condition and within-condition differences (variation), and hence they developed ALDEx, an ANOVA-like differential expression procedure, to identify genes (in the case, taxa/OTUs) with greater between- to within-condition differences via the parametric Welch’s t-test or a non-parametric testing such as Wilcoxon rank sum test or Kruskal-Wallis test (Fernandes et al. 2013, 2014).

ALDEx was developed using the equivalency between Poisson and multinomial processes to infer proportions from counts and model the data as “compositional” or “proportional.” In other words, ALDEx2 uses log-ratio transformation rather than effective library size normalization in count-based differential expression studies. However, when sample sizes are small, the marginal proportions have the large variance and extremely not normally distributed. Thus, ALDEx performs all inferences based on the compositionally valid Dirichlet distribution, i.e., using the full posterior distribution of probabilities drawn from the Dirichlet distribution through Bayesian techniques.

We can summarize the ALDEx2 methods into five fundamental parts (procedures) as below. For the details, the readers can refer to the original publications (Fernandes et al. 2013, 2014).

Part 1: Convert the observed abundances into relative abundances by Monte Carlo (MC) sampling.

Like other CoDa approaches, ALDEx is not interested in the total number of reads, instead of inferring proportions from counts. Denote n_i present the number of counts observed in taxon i, and assume that each taxon’s read count was sampled from a Poisson process with rate μ_i, i.e., n_i~Poisson(μ_i) with n = ∑_in_i. Then, the set of joint counts with given total read counts has a multinomial distribution, i.e., {[n₁, n₂, …]| n} ~ Multinomial (p₁, p₂, …| n) where each p_i = μ_i/∑_kμ_k based on the equivalency between Poisson and multinomial processes.

Since the Poisson process is equivalent to the multinomial process, traditional methods use n_i to estimate μ_i and then use the set of μ_i to estimate p_i. These methods ignore that most datasets of this type contain large numbers of taxa with zero or small read counts; thus the maximum-likelihood estimate of p_i this way is often exponentially inaccurate. Therefore, ALDEx estimate the set of proportions p_i directly from the set of counts n_i. ALDEx uses standard Bayesian techniques to infer the posterior distribution of [p₁, p₂, …] as the product of the multinomial likelihood with a Dirichlet ( $\frac{1}{2}$ , $\frac{1}{2}$ ,…) prior. Considering the large variance and extreme non-normality of the marginal distributions p_i when the associated n_i are small, ALDEx does not summarize the posterior of p_i using point-estimates. Instead, it performs all inferences using the full posterior distribution of probabilities drawn from the Dirichlet distribution such that [p_i, p₂, …]~ Dirichlet $\left(\left[{n}_1,{n}_2,\dots \right]+\frac{1}{2}\right)$ . Adding 0.5 to the Dirichlet distribution, the multivariate distribution avoids the zero problem for the inferred proportions even if the associated count is zero and conserves the probability, i.e., ∑_kp_k = 1.

Because mathematically log-proportions are easily manipulated, ALDEx takes the component-wise logarithms and subtracts the constant $\frac{1}{m}{\sum}_k\log \left({p}_k\right)$ from each log-proportion component for a set of m proportions [p_i, p₂, …, p_m]. This results in the values of the relative abundances ${q}_i=\log \left({p}_i\right)-\frac{1}{m}{\sum}_{k-1}^m\log \left({p}_k\right)$ where ∑_kq_k is always zero. Most important, it projects q onto a m − 1 dimensional Euclidean vector space with linearly independent components. Thus, a traditional ANOVA-like framework can be formed to analyze the q values [q_i, q₂, …, q_m].

Because it is cumbersome to directly compute q distribution, the software estimates the distribution of q from multiple Monte Carlo realizations of p given [n₁, n₂, …n_m]. MC sampling is based on the compositionally valid Dirichlet distribution with the addition of a uniform prior. That is, it repeats the MC sampling for K times (K = 128 times by default) and consequently creating randomized instances dataset, in which for each taxon i in sample j, the observed abundance O_ij is represented by a vector of MC samples of relative abundances ${\left({r}_{ij}^{(1)},\dots, {r}_{ij}^{(k)}\right)}^T$ . This renders the data free of zeros.

Part 2: Perform log-ratio transformation on each of the so-called Monte Carlo (MC) instances usually most through clr- or iqlr-transformation

After obtaining the multivariate Dirichlet proportional distributions, to make a meaningful comparison between-sample values from proportional distributions, ALDEx uses the procedures developed by Aitchison, Egozcue, and others (Aitchison and Egozcue 2005; Egozcue and Pawlowsky-Glahn 2005; Egozcue et al. 2003) to transform component proportions into linearly independent components. That is, it performs the relative abundance vector ${\left({r}_{ij}^{(1)},\dots, {r}_{ij}^{(k)}\right)}^T$ for each sample j and each MC Dirichlet realization k, k = 1, …, K. The distributions are estimated from multiple independent Monte Carlo realizations of their underlying Dirichlet-distributed proportions for all taxa (genes) i simultaneously.

Part 3: Perform statistical hypothesis testing on each MC instance, i.e., each taxon in the vector of clr or iqlr transformed values, to generate P-values (P) for each transcript

The hypothesis testing is performed using the classical statistical tests, i.e., Welch’s t and Wilcoxon rank sum tests for two groups, and glm and Kruskal-Wallis for two or more groups. Since there is a total of K MC Dirichlet samples, each taxon will have K P-values.

Let i = {1, 2, …, I} index taxa (genes), j = {1, 2, …, J} index the groups (conditions), and k = {1, 2, …, K_j} index the replicate of a given group(condition), using the framework of random-effect ANOVA models, the ALDEx model is given as below.

${q}_{ij k}={\mu}_{ij}+{\nu}_{ij k}+{\tau}_{ij k}+{\varepsilon}_{ij k},$

(14.5)

where q_ijk is the adjusted log-abundance (expression); μ_ij is the expected abundance (expression) of taxon i (gene i) within each group (condition) j; ν_ijk is the sample-specific abundance (expression) change for replicate k; τ_ijk is the sampling variation from inferring abundance (expression) from read counts; and ε_ijk is the remaining nonspecific error.

Here, ν_ijk is assumed to be approximately normal as in the usual ANOVA models. The sampling error τ_ijk is given by the adjusted log-marginal distributions of the Dirichlet posterior and its distribution is very Gaussian-like. To ensure it is more appropriate for the analysis of high-throughput sequencing datasets, ALDEx does not assume that within-group (condition) sample-to-sample variation is small and essentially negligible.

Under the ANOVA framework, ALDEx conducts a null hypothesis to test the difference between groups (conditions) j and j^′, for all taxa i (genes i). ${H}_0:{\mu}_{ij}={\mu}_{i{j}^{\prime }}$

Part 4: Average these P-values across all MC instances to yield expected P-values for each taxon.

The expected P-values are Benjamini-Hochberg adjusted P-values (BH) by using Benjamini-Hochberg correction procedure (Benjamini and Hochberg 1995).

Part 5: Develop an estimated effect size metric to compare the predicted between-group differences to within-group differences.

The authors of this model (Fernandes et al. 2013) emphasize the statistical significance by this hypothesis testing in Part 3 above does not imply that the groups (conditions) j and j^′ are meaningfully different. Instead, such meaning can be inferred through an estimated effect size that compares predicted between group (condition) differences to within-group(condition) differences.

Given the set of random variables p_ijk and q_ijk, the within-group (condition) distribution is:

$W\left(i,j\right)=\sum \limits_{k=1}^{K_j}{q}_{ijk},$

(14.6)

the absolute fold difference between-group (condition) distribution is:

${\Delta}_A\left(i,j,{j}^{\prime}\right)=W\left(i,j\right)-W\left(i,{j}^{\prime}\right),$

(14.7)

the between sample, within-group(condition) difference is:

${\Delta}_W\left(i,j\right)=\underset{k\ne {k}^{\prime }}{\max \mid }{q}_{ij k}-{q}_{ij{k}^{\prime }}\mid,$

(14.8)

and the relative effect size:

${\Delta}_R\left(i,j,{j}^{\prime}\right)={\Delta}_A\left(i,j,{j}^{\prime}\right)/\max \left\{{\Delta}_W\left(i,j\right),{\Delta}_W\left(i,{j}^{\prime}\right)\right\}.$

(14.9)

The effect size metric in ALDEx2 is another statistical method that was developed specifically for this package and is recommended for use over the P-values.

For normally distributed data, usually effect sizes measure the standardized mean difference between groups, such as Cohen’s d (1988) measures the standardized difference between two means. Cohen’s d has been widely used in behavioral sciences accompanying reporting of t-test and ANOVA results and meta-analysis. However, in high-throughput sequencing data, the data are often not normally distributed. For non-normality of data, it is very difficult to measure effect sizes and interpret them.

ALDEx2 effect size measure estimates the median standardized difference between groups. It is a standardized distributional effect size metric. To obtain this effect size, take the median of differences (about -2), which provides a non-parametric measure of the between-group difference. Then, scale (normalize) the median of differences by the dispersion: effect = median (difference / dispersion). Where dispersion = maximum (vector a-randomly permutated vector a, vector b- randomly permutated vector b). In the aldex.effect module output, these measures are denoted as: diff.btw = median(diff), diff.win = median(disp), and effect = eff. ALDEx2 also provides a 95% confidence interval of the effect size estimate if in the aldex.effect() function, CI=TRUE is specified (see Sect. 14.2.2).

It was shown that ALDEx2 effect size is somewhat robust and is approximately 1.4 times that of Cohen’s d, as expected for a well-behaved non-parametric estimator. It is equally valid for normal, random uniform, and Cauchy distributions (Fernandes et al. 2018).

14.2.2 Implement ALDEx2 Using R

ALDEx2 takes the compositional data analysis approach that uses Bayesian methods to infer technical and statistical error (Fernandes et al. 2014). ALDEx2 incorporates a Bayesian estimate of the posterior probability of taxon abundance into a compositional framework; that is, it uses a Dirichlet distribution to transform the observed data and then estimates the distribution of taxon abundance by random sampling instances of the transformed data. There are two essential procedures: ALDEx2 first takes the original input data and generates a distribution of posterior probabilities of observing each taxon; then it uses the centered log-ratio transformation to transform this distribution. After centered log-ratio transforming the distribution, a parametric or nonparametric t-test or ANOVA can be performed for the univariate statistical tests and the P-values and Benjamini-Hochberg adjusted P-values are returned.

The ALDEx2 methods are available from the ALDEx2 package (current version 1.27.0, October 2021). There are three approaches that can be used to implement ALDEx2 methods: (1) Bioconductor ALDEx2 modular, (2) ALDEx2 wrapper, and (3) the aldex.glm module. Bioconductor version of ALDEx2 is modular, which is achieved by exposing the underlying center log-ratio transformed Dirichlet Monte-Carlo replicate values and hence is flexible for adding the specific R commands by the users based on their experimental design. Thus, it is suitable for the comparison of many different experimental designs. Currently, the ALDEx2 wrapper is limited to a two-sample t-test (and calculation of effect sizes) and one-way ANOVA design. The aldex.glm module was developed to implement the probabilistic compositional approach for complex study designs.

We illustrated this package in our previous book (Xia et al. 2018a). In this section, we illustrate these three approaches in turn with a new microbiome dataset.

Example 14.1: Breast cancer QTRT1 Mouse Gut Microbiome

This example dataset was introduced in Chap. 9 (Example 9.1). In this dataset, the mouse fecal samples were collected at two time points: pre-treatment and post-treatment. Here we are interested in comparison of the microbiome difference between Genotype (breast cancer cell lines MCF-7 cell (WT) and MCF-7 knockout) at post-treatment.

There are two ways to install the ALDEx2 package. The most recent version of ALDEx2 is available from github.com/ggloor/ALDEx_bioc. It is recommended to run the most up-to-date R and Bioconductor version of ALDEx2. Here we install this stable version of ALDEx2 from Bioconductor.

# Install stable version from Bioconductor

if (!requireNamespace("BiocManager", quietly = TRUE))

install.packages("BiocManager")

BiocManager::install("ALDEx2")

Run the ALDEX Modular Step-by-Step

The aldex modular offers the users to specify their own tests, and then the ALDEx2 modular exposes the underlying intermediate data. To simplify, the ALDEx2 modular approach is just to call aldex.clr, aldex.ttest, and aldex.effect modules (functions) in turn and then merge the data into one data frame. We use the following seven steps to perform ALDEx2 by using the modular approach.

Step 1: Load OTU-table and sample metadata.

> setwd("~/Documents/QIIME2R/Ch14_Compositional/ALDEx2")

> otu_tab<-read.csv("otu_table_L7_MCF7_Post.csv",row.names = 1)

> meta_tab<-read.csv("metadata_QtRNA_Post.csv",row.names = 1)

Here, we want to compare post treatment between genotype (WT and MCF knockout). The above datasets are the subsets of the post treatment groups.

> Genotype<-meta_tab[, c('Genotype')]

> Genotype

[1] "KO" "WT" "WT" "KO" "KO" "WT" "WT" "WT" "WT" "KO" "KO" "KO" "WT" "KO" "WT"

[16] "KO" "WT" "KO" "KO" "WT"

> Group<-meta_tab[, c('Group')]

> Group

[1] 1 0 0 1 1 0 0 0 0 1 1 1 0 1 0 1 0 1 1 0

Step 2: Run the aldex.clr module to generate the random centered log-ratio transformed values.

> library(ALDEx2)

> aldex_clr <- aldex.clr(otu_tab, Genotype, mc.samples=128 , denom="all", verbose=F)

where the argument:

counts table(OTU abundance table), a vector of groups, and the number of Monte-Carlo are the three required inputs.
denom is used to specify a string for indicating if iqlr, zero, or all features are used as the denominator.
verbosity is used to specify the level of verbosity (TRUE or FALSE).

ALDEx2 recommends 128 or more mc.samples for the t-test, 1000 for a rigorous effect size calculation, and at least 16 for ANOVA.

Step 3: Run the aldex.ttest module to perform the Welch’s t and Wilcoxon rank sum test.

Welch’s t-test and Wilcoxon rank sum tests are used to conduct statistical testing of two conditions or groups.

> aldex_tt <- aldex.ttest(aldex_clr, Genotype, paired.test=FALSE, verbose=FALSE)

where the argument:

aldex_clr is the aldex object from aldex.clr module.
paired.test is used to specify whether a paired test should be conducted or not (TRUE or FALSE).

The aldex.ttest() returns the values of we.ep (expected P-value of Welch’s t test), we.eBH (expected Benjamini-Hochberg corrected P-value of Welch’s t test), wi.ep (expected P-value of Wilcoxon rank test), and wi.eBH (expected Benjamini-Hochberg corrected P-value of Wilcoxon test).

As an alternative method of the t-test, we can run the aldex.kw module to perform the Kruskal-Wallis and glm tests for one-way ANOVA, which compares two or more groups.

> aldex_kw <- aldex.kw(aldex_clr)

It returns the values of kw.ep (expected P-value of Kruskal-Wallis test), kw.eBH (expected Benjamini-Hochberg corrected P-value of Kruskal-Wallis test), glm.ep (expected P-value of glm test), and glm.eBH (expected Benjamini-Hochberg corrected P-value of glm test). This module is slow and has not been evaluated for this version used.

Step 4: Run the aldex.effect module to estimate effect size and the within and between group values.

> aldex_effect <- aldex.effect(aldex_clr, CI=T, verbose=FALSE)

where the argument:

aldex_clr is the aldex object from aldex.clr module.
CI is used to indicate whether to include the 95% confidence interval information for the effect size estimate (TRUE or FALSE).
verbose is used to specify the level of verbosity.

This step is required for plotting and is performed in the case of two groups. The aldex.effect module returns all the values:

rab.all (median clr value for all samples in the feature).
rab.win.KO (median clr value for the KO group of samples).
rab.win.WT (median clr value for the WT group of samples).
dif.btw (median difference in clr values between KO and WT groups).
dif.win (median of the largest difference in clr values within KO and WT groups).
effect (median effect size: diff.btw / max(diff.win) and effect.low and effect.high for all instances.
overlap (proportion of effect size that overlaps between the Bayesian distribution of groups KO and WT; i.e., if the overlap is 0: no effect).

Step 5: Merge all data into one object and make a data frame for result viewing and downstream analysis.

The following commands merge the t-test and effect data into one object and name it as aldex_all.

> aldex_all <- data.frame(aldex_tt, aldex_effect)

The following head () function examines the first three lines of data.

> head(aldex_all,3)

we.ep we.eBH wi.ep wi.eBH rab.all rab.win.KO rab.win.WT

OTU_2 0.5406304 0.8283945 0.5739729 0.8502291 -1.328501 -1.536681 -1.1439323

OTU_3 0.4959608 0.8083920 0.5036120 0.8216989 -1.150645 -1.359854 -0.8576377

OTU_4 0.4926840 0.7968770 0.5281530 0.8168477 -1.243212 -1.427288 -1.0419705

diff.btw diff.win effect effect.low effect.high overlap

OTU_2 0.4165297 4.331159 0.08838384 -3.629842 4.652759 0.4515625

OTU_3 0.3566450 4.210935 0.07741749 -3.017144 5.117261 0.4531250

OTU_4 0.3932742 4.087157 0.07952276 -3.702736 4.159538 0.4578125

Step 6: Run the aldex.plot module to generate the MA and MW (effect) plots.

Bland-Altman plot (difference plot, or Tukey mean-difference plot), named after J. Martin Bland and Douglas G. Altman, is a data plotting method for analyzing the agreement between two different measures (Altman and Bland 1983; Martin Bland and Altman 1986; Bland and Altman 1999). Bland-Altman method states that any two methods designing to measure the same property or parameter are not merely highly correlated but also should have agreed sufficiently closely. ALDEx2 provides a Bland-Altman (MA) style plot to graphically compare the degree of agreement of measures between median log₂ between-condition difference and median log₂ relative abundance.
Effect size in ALDEx2 is defined as a measure of the mean ratio of the difference between groups (diff.btw) and the maximum difference within groups (diff.win or variance). The effect size can be obtained by the aldex.effect modular or by specifying effect=TRUE argument using the aldex wrapper. The aldex.plot modular plots median between-group difference versus median within-group difference to visualize differential abundance of the sample data, which are referred to as “effect size” plots in ALDEx2.
The following commands generate the MA and MW (effect) plots (Fig. 14.1).

Fig. 14.1
MA and MW (Effect) plots of ALDEx2 output from aldex.plot() function. The left panel is a Bland-Altman or MA plot that shows the relationship between (relative) abundance and difference. The right panel is an MW (effect) effect plot that shows the relationship between difference and dispersion. In both plots, red represents the statistically significant features that are differentially abundant with Q = 0.05; gray are abundant, but not differentially abundant; black are rare, but not differentially abundant. This function uses the combined output from the aldex.ttest () and aldex.effect() functions. The Log-ratio abundance axis is the clr value for the feature

> # Figure 14.1

> par(mfrow=c(1,2))

> # Bland-Altman (MA) plot

> aldex.plot(aldex_all, type="MA", test="welch", cutoff=0.05, all.cex=0.7, called.cex=1.1,

+ rare.col="black", called.col="red")

> aldex.plot(aldex_all, type="MW", test="welch",cutoff=0.05, all.cex=0.7, called.cex=1.1,

+ rare.col="black", called.col="red")

ALDEx2 generates a posterior distribution of the probability of observing the count given the data collected. Importantly this approach generates the 95% CI of the effect size. ALDEx2 uses a standardized effect size, similar to the Cohen’s d metric. It was shown the effect size in ALDEx2 is more robust and more conservative (being approximately 0.7 Cohen’s d when the data are normally distributed based on Greg Gloor’s note in his AlDEx2_vignette of ANOVA-Like Differential Expression tool for high throughput sequencing data, October 27, 2021).

In general, P-value is less robust than effect size. Thus, more researchers prefer to report effect size than to P-value. If sample size is sufficiently large, an effect size of 0.5 or greater is considered more likely corresponding to biological relevance.

A few features have an expected Q-value that is statistically significantly different. They are both relatively rare and have a relatively small difference. For those features, it can be misleading even when identifying them based on the expected effect size. ALDEx2 finds that the safest approach to identify those features is to find where the 95% CI of the effect size does not cross 0. The 95% CI metric behaves exactly in line with intuition: the precision of estimation of rare features is poor. To identify rare features with more confidence, more deep sequencing is required. The authors of ALDEx2 think this 95% CI approach can identify the biological variation in the data as received (i.e., the experimental design is always as given). This is the approach that was used in Macklaim et al. (2013), and it was independently validated to be very robust (Nelson et al. 2015). In summary, this approach is not inferring any additional biological variation but is identifying those features where simple random sampling of the library would be expected to give the same result every time.

ALDEx2 considers an effect size cutoff of 1.5–2 and an overlap cutoff of 0.01 as more appropriate to identify differential taxa of interest (Fernandes et al. 2013). Below we illustrate two more effect size plots: (1) plot the effect size versus the P-value, and (2) a volcano plot shows the difference between groups versus the P-value (Fig. 14.2).

Fig. 14.2
Relationship between effect size, difference, and P-values and BH-adjusted P-values in the tested dataset. This plot shows that the effect size has a much closer relationship to the P-value than does the raw difference

> # Figure 14.2

> # Effect size and volcano plots

> par(mfrow=c(1,2))

> plot(aldex_all$effect, aldex_all$wi.ep, log="y",pch=19, cex=0.7, col=rgb(0,0,1,0.2),

+ xlab="Effect size", ylab="Expected P-value of Wilcoxon rank sum test", main="Effect size plot")

> points(aldex_all$effect, aldex_all$wi.eBH,pch=19, cex=0.7, col=rgb(1,0,0,0.2))

> abline(h=0.05, lty=2,lwd=2, col="red")

> legend(-4,0.0001, legend=c("P-value", "BH-adjusted"), pch=10, col=c("blue", "red"))

> plot(aldex_all$diff.btw, aldex_all$wi.ep, log="y", pch=19,cex=0.7, col=rgb(0,0,1,0.2),

+ xlab="Difference", ylab="Expected P-value of Wilcoxon rank sum test", main="Volcano plot")

> points(aldex_all$diff.btw, aldex_all$wi.eBH, pch=19, cex=0.7, col=rgb(1,0,0,0.2))

> abline(h=0.05, lty=2, lwd=2,col="red")

Step 7: Identify significant features by both Welch’s t-test and Wilcoxon rank sum test.

> sig_by_both <- which(aldex_all$we.ep < 0.05 & aldex_all$wi.ep < 0.05)

> sig_by_both

[1] 50 53 66 68 74 75 78 81 82 196 202 215 218 224 227 230 232 238 239 243

[21] 245 259 261 263 266 278 288

> sig_by_both_fdr <- which(aldex_all$we.eBH < 0.05 & aldex_all$wi.eBH < 0.05)

> sig_by_both_fdr

[1] 50 66 68 74 75 78 82 202 230 245 261 263

Twenty seven taxa are identified as significant by both Welch’s t-test and Wilcoxon rank sum test, and twelve of these reach significance when the P-values are adjusted for multiple testing corrections using the Benjamini-Hochberg’s method.

The following R commands use the xtable() function from xtable package to make a result table. The xtable package is used to create export tables, converting an R object to an xtable object, which can then be printed as a LaTeX or HTML table. Here, the print.xtable() function is used to export to HTML file. If you want to export the LaTeX file, then use type="latex", file="ALDEx2_Table_Coef_QtRNA.tex" instead.

> # Table 14.1

> # Write results table

> # Make the table

> library(xtable)

> table <-xtable(

+ aldex_all[sig_by_both,c(8:12,1,3,2,4)], caption="Table of significant OTUs",lable="Coef_OTUs_table", digits=3,

+ label="sig.table", align=c("l",rep("r",9) )

+ )

> print.xtable(table,type="html",file = "ALDEx2_Table_Coef_QtRNA.html")

> write.csv(table,file = paste("ALDEx2_Table_Coef_QtRNA.csv",sep = ""))

where the argument:

aldex_all[sig_by_both,c(8:12,1,3,2,4)] is a R object; the element of the object “sig_by_both” is the row of output matrix, the element of the object “c(8:12,1,3,2,4)] is the column of output matrix with the order of columns you want to be in export table.
caption is used to specify the table’s caption or title.
label is used to specify the LaTeX label or HTML anchor.
align indicates the alignment of the corresponding columns and is character vector with the length equal to the number of columns of the resulting table; the resulting table has 9 columns, so the number is 9.
If the R object is a data.frame, the length of align is specified to be 1 + ncol(x) because the row names are printed in the first column.
The left, right, and center alignment of each column are denoted by “l,” “r,” and “c,” respectively. In this table, align=c(“l”,rep(“r”,9) indicates that first column is aligned left, and the remaining 9 columns are aligned right. The digits argument is used to specify the number of digits to display in the corresponding columns (Table 14.1).

Table 14.1

The significant features identified by both Welch’s t-test and Wilcoxon rank sum test with P-values and BH-adjusted P-values in the tested dataset

	diff.btw	diff.win	effect	effect.low	effect.high	we.ep	wi.ep	we.eBH	wi.eBH
OUT_98	−4.698	3.539	−1.269	−9.010	0.347	0.002	0.001	0.033	0.025
OTU_102	1.253	1.333	0.854	−0.993	7.335	0.010	0.014	0.144	0.160
OTU_116	−10.435	2.831	−3.563	−23.145	−0.797	0.000	0.000	0.000	0.001
OTU_119	−3.402	1.443	−2.071	−16.377	−0.173	0.000	0.000	0.002	0.001
OTU_125	−9.904	2.786	−3.558	−26.823	−0.717	0.000	0.000	0.001	0.001
OTU_126	−11.171	2.684	−4.260	−33.527	−0.944	0.000	0.000	0.000	0.001
OTU_131	3.671	1.419	2.414	−0.834	18.509	0.000	0.001	0.002	0.015
OTU_134	3.560	2.865	1.137	−0.848	12.261	0.012	0.001	0.113	0.029
OTU_135	2.528	1.235	1.938	0.198	12.070	0.000	0.000	0.003	0.002
OTU_316	4.763	3.928	1.124	−1.458	10.936	0.015	0.005	0.119	0.064
OTU_323	−7.343	3.637	−2.004	−19.881	−0.072	0.000	0.000	0.003	0.003
OTU_342	0.639	0.830	0.725	−1.801	6.394	0.041	0.043	0.332	0.322
OTU_345	0.819	0.730	1.077	−0.969	7.081	0.004	0.006	0.072	0.086
OTU_354	1.882	2.367	0.676	−1.596	10.338	0.036	0.019	0.301	0.192
OTU_359	2.107	1.420	1.244	−1.761	12.310	0.002	0.005	0.043	0.081
OTU_362	2.818	1.069	2.474	0.351	14.778	0.000	0.000	0.000	0.001
OTU_365	4.726	3.273	1.280	−0.690	11.492	0.006	0.001	0.066	0.020
OTU_371	0.889	0.909	0.820	−0.746	11.161	0.013	0.005	0.177	0.080
OTU_373	2.223	2.327	0.831	−1.384	10.228	0.021	0.021	0.220	0.201
OTU_380	1.832	2.369	0.700	−0.694	16.919	0.031	0.003	0.294	0.057
OTU_385	0.513	0.397	1.265	−0.371	7.901	0.001	0.001	0.019	0.025
OTU_405	0.505	0.653	0.753	−1.231	7.099	0.019	0.022	0.206	0.206
OTU_408	8.446	3.040	2.826	0.644	25.680	0.000	0.000	0.001	0.001
OTU_410	−1.393	0.776	−1.742	−12.694	−0.161	0.000	0.000	0.002	0.002
OTU_413	0.518	0.636	0.808	−1.278	6.340	0.022	0.027	0.229	0.243
OTU_428	0.808	1.049	0.719	−1.418	7.025	0.043	0.049	0.34	0.344
OTU_448	−1.122	1.311	−0.887	−7.396	1.385	0.010	0.011	0.144	0.142

Table of significant OTUs

Only those significant taxa detected in both Welch’s t-test and Wilcoxon rank sum tests are printed in Table 14.1. We can interpret the table this way, for the OTU_98, the absolute difference between KO and WT groups can be up to −4.698, implying that the absolute fold change in the ratio between OTU_98 and all other taxa between KO and WT groups for this organism is on average (1/2)^−4.698 = 25.96 fold across samples. The difference within the groups of 3.539 is roughly equivalent to the standard deviation, giving an effect size of −4.698/3.539 = −1.327 (here exactly: −1.269 [−9.010, 0.347]).

Run the ALDEX Wrapper

When running the aldex wrapper, it will link the modular elements together to emulate ALDEx2 prior to the modular approach. In the simplest case, the aldex wrapper performs a two-sample t-test and calculates effect sizes.

> aldex_t <- aldex(otu_tab, Genotype, mc.samples=128, test="t", effect=TRUE,

+ include.sample.summary=FALSE, denom="all", verbose=FALSE)

Here the test group ‘Genotype’ has two levels (KO vs. WT). So this is two-sample t-test. We specify test = “t,” and then the effect should be set to TRUE. The “t” option evaluates the data as a two-factor experiment using both Welch’s t-test and the Wilcoxon rank sum test. Like other tests, t-tests also include a Benjamini-Hochberg correction of the raw P-values. The data also can be plotted onto Bland-Altman (MA) or effect (MW) plots (Fig. 14.3).

Fig. 14.3
MA and MW (effect) plots from t-tests in ALDEx2 output

The inter-quartile log-ratio (iqlr) transformation was introduced in the ALDEx2 package (see Sect. 14.1.6.4). The following commands use iqlr transformation and generate MA and MW plots.

> aldex_t_iqlr <- aldex(otu_tab, Genotype, mc.samples=128, test="t", effect=TRUE,

+ include.sample.summary=FALSE, denom="iqlr", verbose=FALSE)

> head(aldex_t_iqlr,3)

rab.all rab.win.KO rab.win.WT diff.btw diff.win effect overlap we.ep

OTU_2 -0.7688690 -0.9230157 -0.6292115 0.4898984 4.143746 0.09669726 0.4531250 0.5048162

OTU_3 -0.3587615 -0.5131136 -0.2026998 0.2097475 4.087856 0.04867811 0.4664587 0.5095202

OTU_4 -0.7754322 -0.9341565 -0.6243442 0.1912022 4.060661 0.03592633 0.4804992 0.4813148

we.eBH wi.ep wi.eBH

OTU_2 0.8170884 0.5300930 0.8402394

OTU_3 0.8178020 0.5329214 0.8449501

OTU_4 0.8107567 0.5222389 0.8325645

> # Figure 14.3

> par(mfrow=c(1,2))

> aldex.plot(aldex_t_iqlr, type="MA", test="welch", cutoff=0.05, all.cex=0.7, called.cex=1.1,

+ rare.col="black", called.col="red", xlab="Log-ratio abundance",

+ ylab="Difference")

> aldex.plot(aldex_t_iqlr, type="MW", test="welch",cutoff=0.05, all.cex=0.7, called.cex=1.1,

+ rare.col="black", called.col="red", xlab="Dispersion",

+ ylab="Difference")

The left panel is an Bland-Altman (MA) plot that shows the relationship between (relative) abundance and difference. The right panel is an MW (effect) plot that shows the relationship between difference and dispersion. In both plots, red dots indicate that the features are statistically significant and gray or black dots indicate the features are not significant. The log-ratio abundance axis is the clr value for the feature.

Run the ALDEX.GLM Module Using Complex Study Designs

ALDEx2 also has the aldex.glm module that can be used to implement the probabilistic compositional approach for complex study designs. This module is substantially slower compared to the above two-comparison tests; however, the users can implement their own study designs. Essentially, this approach is same as the above modular approach but requires the users to provide a model matrix and covariates to the glm() function in R.

Example 14.2: Breast Cancer QTRT1 Mouse Gut Microbiome, Example 14.1, Cont.

Here, we use the full dataset from Example 14.1 to illustrate the aldex.glm module. In this dataset, there are two main effect variables Group and Time. We are interested in comparison of the microbiome difference between Group (coded as WT and KO), Time (coded as Before and Post) and their interaction term.

Step 1: Load the count table and metadata.

> setwd("~/Documents/QIIME2R/Ch14_Compositional/ALDEx2")

> otu_table<-read.csv("otu_table_L7_MCF7_phyloseq.csv",row.names = 1)

> meta_table<-read.csv("metadata_QtRNA.csv",row.names = 1)

> head(meta_table,3)

MouseID Genotype Group Time Group4 Total.Read

Sun071.PG1 PG1 KO 1 Post KO_POST 61851

Sun027.BF2 BF2 WT 0 Before WT_BEFORE 42738

Sun066.PF1 PF1 WT 0 Post WT_POST 54043

Step 2: Set model matrix and covariates.

> Group<-meta_table[, c('Group')]

> Time<-meta_table[, c('Time')]

> TotalRead <- meta_table[, c('Total.Read')]

> covariates <- data.frame('Group', 'Time','TotalRead')

> mod <- model.matrix(~ Group*Time, covariates)

Step 3: Run the aldex.glm() function.

> aldex_clr_glm <- aldex.clr(otu_table, mod, mc.samples=128, denom="all", verbose=F)

> aldex_glm <- aldex.glm(aldex_clr_glm, mod)

Step 4: Run the aldex.glm.effect () function.

> aldex_glm_effect <- aldex.glm.effect(aldex_clr_glm)

The aldex.glm.effect () function calculates the effect size for each binary predictor in the matrix and outputs to a named list.

Step 5: Plot the effect sizes.

Figure 14.4 can be generated using the following R commands, which plots the BH-corrected glm values for the actual test case vs. the effect size for the binary predictor.

Fig. 14.4
Plot of effect sizes for Group by Time interaction at post-treatment in breast cancer QTRT1 mouse gut microbiome

> # Figure 14.4

> aldex.plot(aldex_glm_effect [["Group:TimePost"]], test="effect", cutoff=2)

> sig <- aldex_glm [,20]<0.05

> points(aldex_glm_effect [["Group:TimePost"]]$diff.win[sig],

+ aldex_glm_effect [["Group:TimePost"]]$diff.btw[sig], col="blue")

> sig <- aldex_glm [,20]<0.2

> points(aldex_glm_effect [["Group:TimePost"]]$diff.win[sig],

+ aldex_glm_effect [["Group:TimePost"]]$diff.btw[sig], col="blue")

Step 6: Write out the glm test and effect table.

First, merge glm and glm effect tests into one object and make a data frame.

The following commands merge the aldex_glm and aldex_glm_effect data into one object and name it as aldex_all_glm (Table 14.2).

aldex_all_glm <- data.frame(aldex_glm, aldex_glm_effect)

> sig_glm <- which(aldex_all_glm[,20] < 0.05)# BH-corrected P-values of Group.TimePost

Table 14.2

The significant features identified by glm test with BH-adjusted P-values in the tested dataset

	Group.TimePost.diff.btw	Group.TimePost.diff.win	Group.TimePost.effect	Group.TimePost.overlap	model.Group.TimePost.Pr…t…BH
OUT_116	10.770	2.999	3.678	0.000	0.027
OTU_119	3.022	1.552	1.787	0.031	0.012
OTU_125	9.862	2.830	3.480	0.000	0.016
OTU_126	11.211	2.707	4.076	0.000	0.047
OTU_362	−3.225	1.255	−2.480	0.002	0.013

Table of significant OTUs

> # Table 14.2

> # Write effect table

> # Make the table

> library(xtable)

> table <-xtable(

+ aldex_all_glm [sig_glm,c(38:41,20)], caption="Table of significant OTUs",lable="Coef_OTUs_table", digits=3,

+ label="sig.table", align=c("l",rep("r",5) ))

> print.xtable(table,type="html",file = "ALDEx2_Table2_Coef_QtRNA_glm.html")

> write.csv(table,file = paste("ALDEx2_Table2_Coef_QtRNA_glm.csv",sep = ""))

14.2.3 Remarks on ALDEx2

ALDEx2(ALDEx) was designed to identify differential expression of genes, relative to the geometric mean abundance, between two or more groups. It was shown (Fernandes et al. 2013) that the ALDEx2 methods are robust comparing to existing representative methods in RNA-seq such as edgeR (Robinson et al. 2010) and DESeq (Anders and Huber 2010). It was also demonstrated that ALDEx2 has very high precision in identifying differentially expressed genes (and transcripts) for 16S rRNA data and high recall too given sufficient sample sizes (Quinn et al. 2018a, b). Although ALDEx2 was shown having the potential to be generalized using any type of high-throughput sequencing data and has often used for analyzing the differential abundant taxa (OTUs) among meta-genomics researchers (Urbaniak et al. 2016), ALDEx2 has several limitations, for example:

ALDEx2 cannot address zero-inflation problem, instead replaces zero read counts using valid Dirichlet distribution (Fernandes et al. 2013).
ALDEx2 statistical tests are hard to interpret until the log-ratio transformation sufficiently approximate an unchanged reference (Quinn et al. 2018a, b). The performance in this point is dependent on transformations; it was showed that the inter-quartile range log-ratio (iqlr) transformation outperform the centered log-ratio (clr) transformation and was recommended using as the default setting for ALDEx2 (Quinn et al. 2018a, b).
ALDEx2 requires a large number of samples because it performs statistical hypothesis testing using non-parametric methods such as Wilcoxon rank sum test for comparisons of two groups and Kruskal-Wallis test for comparisons of more than two groups. Non-parametric differential expression methods have been suggested reducing statistical power and hence require large number of samples (Quinn et al. 2018a, b; Seyednasrollah et al. 2013; Tarazona et al. 2015; Williams et al. 2017; Quinn et al. 2018a, b). It was reviewed partially due to its non-parametric nature and weaknesses of ALDEx2 methods, the ALDEx2 package has not been widely adopted in the analysis of RNA-seq data (Quinn et al. 2018a, b).
ALDEx2 interprets the log-ratio transformation as a normalization. Thus, ALDEx2 is majorly limited to this normalization. It is hard to interpret the results of the statistical tests even in the setting of large sample sizes when the log-ratio transformation does not sufficiently approximate an unchanged reference (Quinn et al. 2018a, b).

ALDEx2 was reported having difficulty to control FDR (Lin and Peddada 2020b) and to maintain statistical power compared to competing differential abundance methods (e.g., ANCOM, ANCOM-BC, edgeR, and DESeq2) (Lin and Peddada 2020b; Morton et al. 2019).

14.3 Analysis of Composition of Microbiomes (ANCOM)

Like ALDEx2, ANCOM was developed under statistical framework of ANOVA using a non-parametric testing to analyze relative abundance through its log-ratio-transformation of observed counts.

14.3.1 Introduction to ANCOM

ANCOM (Mandal et al. 2015), an alr (additive log-ratio)-based method, was proposed based on alr-transformation to account for the compositional structure of microbiome data. ANCOM repeatedly uses alr-transformation to choose each of the taxa in the dataset as a reference taxon at a time. Thus, given a total of m taxa, ANCOM will choose each of the m taxa to be a reference taxon one time and repeatedly perform the alr-transformation for each taxon and m − 1 regressions. Therefore, total m(m − 1) regression models will be fitted.

Like an ANOVA model, ANCOM models the log-ratios of OTU abundances in each sample with a linear model, but, unlike typical ANOVA, ANCOM accommodates dependencies and correlations among the relative abundances of the OTUs by using of log-ratios. That is, ANCOM lies on framework of ANOVA, while accounts for the compositional structure of microbiome data. The dependencies and hence compositional structure arise in microbiome data because the relative abundances of OTUs sum to 1 or 100% in each sample and because relative abundances of different OTUs may be positively or negatively correlated.

ANCOM lies on the following two assumptions: (1) within the ecosystem, the mean log absolute abundance of two taxa are not different, and (2) within the ecosystem, the mean log absolute abundance of all taxa does not change by the same amount between two (or more) study groups (Mandal et al. 2015). Putting (1) and (2) together, ANCOM assumes that the abundant taxa are not differentiated with inter-taxa ratios transformation. We can summarize the methods of ANCOM into three main parts as below.

Part 1: Develop a statistical model under the framework of ANOVA to perform compositional differential analysis of each taxon.

Under these two assumptions, for the i^th taxon and j^th sample, ANCOM developed a statistical model using a standard ANOVA model formulation to perform all possible DA analyses by successively using each taxon as a reference taxon.

$\log \left(\frac{r_{ij}^{(g)}}{r_{i^{\prime }j}^{(g)}}\right)={\alpha}_{i{i}^{\prime }}+{\beta}_{i{i}^{\prime}}^{(g)}+\sum \limits_k{X}_{jk}{\beta}_{i{i}^{\prime }k}+{\varepsilon}_{i{i}^{\prime }j}^{(g)},$

(14.10)

where i^′ is the reference taxon, i^′ ≠ i = 1, 2, …, m, and g = 1, 2, …, G is the number of study groups.

Part 2: Conduct a null hypothesis regarding mean log absolute abundance in a unit volume of an ecosystem using relative abundances.

It was shown (Mandal et al. 2015) that by combining these two assumptions, it can conduct a null hypothesis regarding mean log absolute abundance in a unit volume of an ecosystem using relative abundances. And to test whether a taxon i is differentially abundant regarding a factor of interest with G levels is equivalent to test the null hypothesis and the alternative hypothesis (Lin and Peddada 2020b):

${H}_{0\left(i{i}^{\prime}\right)}:{\beta}_{ii\hbox{'}}^{(1)}=\cdots ={\beta}_{i{i}^{\prime}}^{(G)}=0,\kern0.5em \mathrm{and}\kern0.5em {H}_{a\left(i{i}^{\prime}\right)}\kern0.5em :\kern0.5em \mathrm{not}\ \mathrm{all}\kern0.5em {\beta}_{i{i}^{\prime}}^{(g)}=0,\kern0.5em {\beta}_{i{i}^{\prime}}^{(g)}=0,\kern0.5em \mathrm{for}\ \mathrm{every}\kern0.5em i\ne {i}^{\prime }.$

Part 3: Adjust the P-values from the tests of taxon.

There are $\frac{m\left(m-1\right)}{2}$ P-values from the $\frac{m\left(m-1\right)}{2}$ tests of taxon. ANCOM uses Benjamini-Hochberg (BH) procedure (Benjamini and Hochberg 1995) or Bonferroni correction procedure (Dunn 1958, 1961) to adjust the P-values for a multiple testing correction. For each taxon, the number of rejections W_i is based on the empirical distribution of {W₁, W₂, …, W_m}, which determines the cutoff value of significant taxon (Mandal et al. 2015). The large value of W_i indicates that taxon i is more likely differentially abundant. However, the decision rule of choosing cutoff value of W_i is kind of arbitrary in ANCOM, although 70th percentile of the W distribution is recommended. The users can select different threshold of cutoff value such as the 60th to 90th to output the statistical testing results.

14.3.2 Implement ANCOM Using QIIME 2

Example 14.3: Mouse Gut Microbiome Data, Example 9.2 Cont.

These mouse gut microbiome datasets were generated by DADA2 within QIIME 2 from raw 16S rRNA sequencing data and have been used for illustrating statistical analyses of alpha and beta diversities, as well other analyses and plots via QIIME 2 in Chaps. 9, 10, and 11. Here, we continue to use these datasets to illustrate implementation of ALDEx2 using QIIME 2. We showed in Chap. 10 via emperor plot that a lot of features were changed in abundance over time (Early and Late times). In this section, we illustrate how to test differential abundances with ANCOM using the same mouse gut data in Example 9.2.

ANCOM can be implemented via R and QIIME 2. Currently, ANCOM does not have an R package yet, to run ANCOM in R, we need to download the ANCOM code (ancom_v2.1.R) and then upload and store under the working directory, run the following code: source("ancom_v2.1.R"). Here, we implemented ANCOM via the q2-composition plugin. To perform ANCOM to identify features that are differentially abundant across groups, we need call qiime composition ancom command. To implement this command, two data files are required: (1) an artifact feature table (here, FeatureTableMiSeq_SOP.qza) and (2) a sample metadata (here, SampleMetadataMiSeq_SOP.tsv). The artifact feature table is used for ANCOM computation, while in the sample metadata file, a categorical sample metadata column is required to specify for the test of differential abundance across groups. A visualization output file name is also required.

Step 1: Use the filter-samples command to filter feature table.

ANCOM assumes that less than about 25% of the features are changing between groups. If this assumption is violated, then ANCOM will be more possible to increase both Type I and Type II errors. As we showed in Chap. 10 via emperor plot that a lot of features were changed in abundance over time (Early and Late times), here we will filter the full feature table to only contain late mouse gut samples and then perform ANCOM to identify which sequence variants and taxa are differentially abundant across the male and female mouse gut samples.

source activate qiime2-2022.2

cd QIIME2R-Biostatistics

cd ANCOM

# Filter out the samples with less than 1153 reads (see details in Chapter 9)

qiime feature-table filter-samples\

--i-table FeatureTableMiSeq_SOP.qza\

--m-metadata-file SampleMetadataMiSeq_SOP.tsv\

--p-where "[Time]='Late'"\

--p-min-frequency 1153\

--o-filtered-table LateMouseGutTable.qza

# Filter features that only appear in a single sample

qiime feature-table filter-features \

--i-table LateMouseGutTable.qza \

--p-min-samples 2 \

--o-filtered-table LateMouseGutTable2.qza

# Filter features that only appear 10 times or fewer across all samples

qiime feature-table filter-features \

--i-table LateMouseGutTable.qza \

--p-min-frequency 10 \

--o-filtered-table LateMouseGutTable3.qza

Step 2: Use taxa collapse command to collapse feature table

Support we want to analyze this data at genus level (Level 6), then we can apply following commands to collapse the feature table at genus level.

# Collapse feature table at genus level

qiime taxa collapse \

--i-table LateMouseGutTable3.qza \

--i-taxonomy TaxonomyMiSeq_SOP.qza \

--p-level 6 \

--o-collapsed-table LateMouseGutTableL6.qza

If we want to analyze this data at species level (Level 7), then we can apply the following commands to collapse the feature table.

# Collapse feature table at species level

qiime taxa collapse \

--i-table LateMouseGutTable3.qza \

--i-taxonomy TaxonomyMiSeq_SOP.qza \

--p-level 7 \

--o-collapsed-table LateMouseGutTableL7.qza

The following intermediate files can be removed from the directory by rm command, such as rm LateMouseGutTable.qza. However, we keep them there for later review.

LateMouseGutTable.qza

LateMouseGutTable2.qza

LateMouseGutTable3.qza

Step 3: Use the add-pseudocount command to add a small pseudocount to produce the compositional feature table artifact.

As a compositional method, ANCOM cannot address zero issue because frequencies of zero are not defined when taking log or log-ratio transformation. ANCOM operates on a FeatureTable[Composition] QIIME 2 artifact, which is based on frequencies of features on a per-sample basis. To build the composition artifact (a FeatureTable[Composition] artifact), a small pseudocount must be added to the FeatureTable[Frequency] artifact typically via an imputation method.

# Produce the compositional feature table artifact

qiime composition add-pseudocount \

--i-table LateMouseGutTableL6.qza \

--o-composition-table CompLateMouseGutTableL6.qza

# Produce the compositional feature table artifact

qiime composition add-pseudocount \

--i-table LateMouseGutTableL7.qza \

--o-composition-table CompLateMouseGutTableL7.qza

Step 4: Perform ANCOM to identity differential features across the mouse gut sex groups.

# Perform ANCOM to identity differential features

qiime composition ancom \

--i-table CompLateMouseGutTableL6.qza \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv \

--m-metadata-column Sex \

--o-visualization AncomSexMisSeq_SOPL6.qzv

# Perform ANCOM to identity differential features

qiime composition ancom \

--i-table CompLateMouseGutTableL7.qza \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv \

--m-metadata-column Sex \

--o-visualization AncomSexMisSeq_SOPL7.qzv

We can view the plots using the qiime tools view command:

qiime tools view A text reads, A n c o m S e x M i s S e q underscore S O P L 7 dot q z v.

This view command displays three results: (1) ANCOM Volcano Plot, (2) ANCOM statistical results, and (3) percentile abundances of features by group. All these results can be exported, downloaded, and saved.

Table 14.3 is a copy of ANCOM statistical results. The first five taxa were identified as abundant taxa.

Table 14.3

Abundant and not abundant taxa identified by ANCOM statistical testing

Kingdom	Phylum	Class	Order	Family	Genus	Species	W	Reject null hypothesis
k__Bacteria	p__Firmicutes	c__Bacilli	o__Lactobacillales	f__Lactobacillaceae	g__Lactobacillus	s__	48	TRUE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Clostridiaceae	__	__	48	TRUE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Peptococcaceae	g__rc4-4	s__	47	TRUE
k__Bacteria	p__Actinobacteria	c__Actinobacteria	o__Bifidobacteriales	f__Bifidobacteriaceae	g__Bifidobacterium	s__pseudolongum	44	TRUE
k__Bacteria	p__Verrucomicrobia	c__Verrucomicrobiae	o__Verrucomicrobiales	f__Verrucomicrobiaceae	g__Akkermansia	s__muciniphila	44	TRUE
k__Bacteria	p__TM7	c__TM7-3	o__CW040	f__F16	g__	s__	38	FALSE
k__Bacteria	p__Firmicutes	c__Bacilli	o__Lactobacillales	f__Lactobacillaceae	g__Lactobacillus	__	35	FALSE
k__Bacteria	p__Firmicutes	c__Bacilli	o__Turicibacterales	f__Turicibacteraceae	g__Turicibacter	s__	35	FALSE
k__Bacteria	p__Proteobacteria	c__Gammaproteobacteria	o__Pseudomonadales	f__Pseudomonadaceae	g__Pseudomonas	s__veronii	25	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	__	__	__	23	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__[Mogibacteriaceae]	g__	s__	18	FALSE
k__Bacteria	p__Tenericutes	c__Mollicutes	o__RF39	f__	g__	s__	17	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Lachnospiraceae	g__Dorea	s__	16	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Ruminococcaceae	g__Oscillospira	s__	14	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Lachnospiraceae	g__	s__	14	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__	g__	s__	14	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Ruminococcaceae	g__Butyricicoccus	s__pullicaecorum	13	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Dehalobacteriaceae	g__Dehalobacterium	s__	13	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Lachnospiraceae	g__Clostridium	s__colinum	12	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Ruminococcaceae	g__Clostridium	s__methylpentosum	12	FALSE
k__Bacteria	p__Firmicutes	c__Erysipelotrichi	o__Erysipelotrichales	f__Erysipelotrichaceae	g__Allobaculum	s__	12	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Ruminococcaceae	g__	s__	11	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Ruminococcaceae	__	__	11	FALSE
k__Bacteria	p__Tenericutes	c__Mollicutes	o__Anaeroplasmatales	f__Anaeroplasmataceae	g__Anaeroplasma	s__	10	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Lachnospiraceae	__	__	10	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Lachnospiraceae	g__Roseburia	s__	10	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Christensenellaceae	g__	s__	10	FALSE
k__Bacteria	p__[Thermi]	c__Deinococci	o__Deinococcales	f__Deinococcaceae	g__Deinococcus	s__	9	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Clostridiaceae	g__Clostridium	s__butyricum	9	FALSE
k__Bacteria	p__Firmicutes	c__Erysipelotrichi	o__Erysipelotrichales	f__Erysipelotrichaceae	g__Coprobacillus	s__	9	FALSE
k__Bacteria	p__Firmicutes	c__Bacilli	o__Bacillales	f__Staphylococcaceae	g__Staphylococcus	__	9	FALSE
k__Bacteria	p__Proteobacteria	c__Betaproteobacteria	o__Neisseriales	f__Neisseriaceae	g__Neisseria	s__cinerea	9	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Lachnospiraceae	g__Blautia	__	9	FALSE
k__Bacteria	p__Proteobacteria	c__Gammaproteobacteria	o__Pseudomonadales	f__Moraxellaceae	g__Acinetobacter	s__guillouiae	9	FALSE
k__Bacteria	p__Firmicutes	c__Bacilli	o__Bacillales	f__Staphylococcaceae	g__Jeotgalicoccus	s__psychrophilus	9	FALSE
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	o__Rickettsiales	f__mitochondria	__	__	9	FALSE
k__Bacteria	p__Proteobacteria	c__Gammaproteobacteria	o__Enterobacteriales	f__Enterobacteriaceae	__	__	9	FALSE
k__Bacteria	p__Bacteroidetes	c__Bacteroidia	o__Bacteroidales	f__Bacteroidaceae	g__Bacteroides	s__ovatus	9	FALSE
k__Bacteria	p__Actinobacteria	c__Coriobacteriia	o__Coriobacteriales	f__Coriobacteriaceae	__	__	8	FALSE
k__Bacteria	p__Firmicutes	c__Bacilli	o__Lactobacillales	f__Streptococcaceae	g__Streptococcus	s__	8	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Clostridiaceae	g__Candidatus Arthromitus	s__	8	FALSE
k__Bacteria	p__Firmicutes	c__Bacilli	o__Lactobacillales	f__Lactobacillaceae	g__Lactobacillus	s__reuteri	7	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Lachnospiraceae	g__[Ruminococcus]	s__gnavus	7	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Ruminococcaceae	g__Ruminococcus	s__	7	FALSE
k__Bacteria	p__Firmicutes	c__Clostridia	o__Clostridiales	f__Lachnospiraceae	g__Coprococcus	s__	7	FALSE
k__Bacteria	p__Actinobacteria	c__Coriobacteriia	o__Coriobacteriales	f__Coriobacteriaceae	g__	s__	6	FALSE
k__Bacteria	p__Cyanobacteria	c__Chloroplast	o__Streptophyta	f__	g__	s__	6	FALSE
k__Bacteria	p__Actinobacteria	c__Coriobacteriia	o__Coriobacteriales	f__Coriobacteriaceae	g__Adlercreutzia	s__	5	FALSE
k__Bacteria	p__Bacteroidetes	c__Bacteroidia	o__Bacteroidales	f__Rikenellaceae	g__	s__	5	FALSE
k__Bacteria	p__Bacteroidetes	c__Bacteroidia	o__Bacteroidales	f__S24-7	g__	s__	5	FALSE

If we do not want to collapse feature table, then the following analyses are used to add pseudocount and perform ANCOM.

# Produce the compositional feature table artifact

qiime composition add-pseudocount \

--i-table LateMouseGutTable3.qza \

--o-composition-table CompLateMouseGutTable.qza

# Perform ANCOM to identity differential features

qiime composition ancom \

--i-table CompLateMouseGutTable.qza \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv \

--m-metadata-column Sex \

--o-visualization AncomSexMisSeq_SOP.qzv

--verbose

ANCOM provides two optional parameters --p-transform-function and --p-difference-function to perform identity differential features. Both parameters are TEXT choices [(‘sqrt’, ‘log’, ‘clr’), (‘mean_difference’, ‘f_statistic’) for --p-transform-function and --p-difference-function, respectively]. The transform-function is used to specify the method to transform feature values before generating volcano plots. The default is “clr” (centered log ratio-transformation). Other two are square root (“sqrt”) and log (“log”) transformations. The difference-function is used to specify the method to visualize fold difference in feature abundances across groups for volcano plots. One command is provided below.

# Perform ANCOM to identity differential features

qiime composition ancom \

--i-table CompLateMouseGutTableL7.qza \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv \

--m-metadata-column Sex \

--o-visualization AncomSexMisSeq_SOPL7.qzv

--verbose

--p-transform-function Choices‘clr’\

--p-difference-function Choices ‘mean_difference’\

14.3.3 Remarks on ANCOM

It was shown that ANCOM has well controlled the FDR while maintaining power comparable with other methods (Mandal et al. 2015; Lin and Peddada 2020a). However, like ALDEx2, ANCOM actually uses log-ratio transformations as a kind of normalization (log-ratio “normalizations”). Thus, ANCOM suffers from some similar limitations as ALDEx2:
- Its usefulness mainly depends on interpreting the log-ratio transformation as a normalization. In other words, the statistical tests can be appropriately interpreted only when the log-ratio transformation sufficiently approximates an unchanged reference (Quinn et al. 2018a, b). In contrast, it was reviewed that the methods that do not require using log-ratio transformations as a kind of normalization are more appropriate (Quinn et al. 2018a, b).
- Cannot address zero-inflation problem, instead adds an arbitrary small pseudocount value such as 0.001, 0.5, or 1 to all read counts. This indicates compositional analysis approach fails in the presence of zero values (Xia et al. 2018a, p. 389). It was shown that due to improper handling of the zero counts, ANCOM inflated the number of false positives instead of controlled the false-positive rate (FDR) (Brill et al. 2019).
- It is underpowered and requires a large number of samples due to using non-parametric testing (Quinn et al. 2018a, b) as well as has decreased sensitivity on small datasets (e.g., less than 20 samples per group) partially because of its non-parametric nature (i.e., Mann-Whitney test) (Weiss et al. 2017; Quinn et al. 2018a, b).
Specifically, ANCOM has other weaknesses, including:
- ANCOM takes the approach of alr-transformation and transforms the observed abundances of each taxon to log-ratios of the observed abundance relative to a pre-specified reference taxon. Because when performing differential abundance analysis, typically, the comparing samples have thousands of taxa in microbiome data sets. Thus, it is computationally intensive to repeatedly apply alr-transformation to each taxon in the dataset as a reference taxon. The choice of this reference taxon is also a challenge when the number of taxa is large.
- Whether ANCOM can control FDR well or not, different studies have reported inconsistent results: some different simulation studies reported that ANCOM can control FDR reasonably well under various scenarios (Lin and Peddada 2020a; Weiss et al. 2017), whereas other studies reported that ANCOM could generate a potential false-positive result when a cutoff of 0.6 was used for the W statistic (Morton et al. 2019). Thus, when ANCOM is used in differential abundance analysis, a stricter cutoff value for the W statistic is recommended to reduce the chance of false positives.
- ANCOM uses the quantile of its test statistic W instead of P-values to conduct statistical testing for significance. This not only makes the analysis results difficult to interpret (Lin and Peddada 2020b) but also does not make ANCOM to improve its performance by filtering taxa before analysis. ANCOM was reported having reduced the number of detected differential abundant taxa. This is most likely related to the way that W statistics are calculated and used for significance in ANCOM (Wallen 2021).
- ANCOM does not provide P-value for individual taxon and cannot provide standard errors or confidence intervals of DA for each taxon (Lin and Peddada 2020a).
- It is difficult to interpret the testing differential abundance results because ANCOM uses presumed invariant features to guide the log-ratio transformation (Quinn et al. 2018a, b).

14.4 Analysis of Composition of Microbiomes-Bias Correction (ANCOM-BC)

ANCOM-BC is a bias correction version of ANCOM (analysis of compositions of microbiomes).

14.4.1 Introduction to ANCOM-BC

ANCOM-BC (Lin and Peddada 2020a) was proposed for differential abundance (DA) analysis of microbiome data with bias correction to ANCOM. ANCOM-BC was developed under the assumptions that (1) the observed abundance in a feature table is expected to be proportional to the unobservable absolute abundance of a taxon in a unit volume of the ecosystem. (2) The sampling fraction varies from sample to sample and hence inducing the estimation bias. Thus, to address the problem of unequal sampling fractions, ANCOM-BC uses a sample-specific offset term to serve as the bias correction (Lin and Peddada 2020a). The offset term is usually used in the generalized linear models such as Poisson, zero-inflated models to adjust for the sampling population (Xia et al. 2018b). Here ANCOM-BC uses a sample-specific offset term in a linear regression framework.

We can summarize the methods of ANCOM-BC into the following six parts.

Part 1: Define the sample-specific sampling fraction.

Unlike some other DA studies, which defines relative abundances of taxa as frequencies of these taxa in a sample, in ANCOM-BC, relative abundance of a taxon in the sample refers to the fraction of the taxon observed in the feature table relative to the sum of all observed taxa corresponding to the sample in the feature table (Lin and Peddada 2020a, b). Actually, it is the proportion of this taxon relative to the sum of all taxa in the sample with range of (0, 1). In ANCOM-BC, absolute abundance is defined as unobservable actual abundance of a taxon in a unit volume of an ecosystem, while observed abundance refers to the observed counts of features (OTUs or ASVs) in the feature table (Lin and Peddada 2020b).

Let O_ij denote the observed abundance of i^th taxon in j^th sample, A_ij the unobserved abundance of i^th taxon in the ecosystem of i^th sample, and then the sample-specific sampling fraction c_j is defined as:

${c}_j=\frac{E\left({O}_{ij}|{A}_{ij}\right)}{A_{ij}},$

(14.11)

where c_j is the ratio of the expected abundance of taxon i in the j^th sample to its absolute abundance in a unit volume of an ecosystem such as gut where the sample was derived from, which could be empirically estimated by the ratio of library size to the microbial load.

As we described in Sect. 14.1.2, both the library size and the microbial loads could drive microbiome data to be compositional. Thus, ANCOM-BC not only normalizes the library size to effect library size across samples but also takes the differences of the microbial loads into account. Li and Peddada (2020a) thought the normalized data using the sampling fractions is better than the normalized data obtained from the normalization methods that rely purely on the library sizes. The normalization method that accounts for differences in sampling fractions can avoid the bias in differential abundance analysis due to the increased false-positive and false-negative rates.

Part 2: Describe two model assumptions of ANCOM-BC.

Assumption 1 is given below.

$\begin{array}{l}E\left({Oi}_{jk}|{A}_{ijk}\right)={c}_{ik}{A}_{ijk}\\ {}\mathrm{Var}\left({Oi}_{jk}|{A}_{ijk}\right)={\sigma}_{w, ijk}^2,\end{array}$

(14.12)

where ${\sigma}_{w, ijk}^2$ is the variability between specimens within the k^th sample from the j^th group, which characterizes the within-sample variability. Usually the within-sample variability is not estimated since typically at a given time only one specimen is available in most microbiome studies. The assumption in (14.12) states that the absolute abundance of a taxon in a random sample is expected to be in constant proportion to the absolute abundance in the ecosystem of the sample. That is, the expected relative abundance of each taxon in a random sample equals to the relative abundance of the taxon in the ecosystem of the sample (Lin and Peddada 2020a).

Assumption 2 is given below. For each taxon i, A_ijk, j = 1, …, g, k = 1, …, n_j, are independently distributed with

$\begin{array}{l}E\left({A}_{ij k}|{\theta}_{ij}\right)={\theta}_{ij}\\ {}\mathrm{Var}\left({A}_{ij k}|{\theta}_{ij}\right)={\sigma}_{b, ij}^2,\end{array}$

(14.13)

where θ_ij is a fixed parameter rather than a random variable; ${\sigma}_{b, ij}^2$ is the between-sample variation within group j for the i^th taxon. The assumption in (14.13) states that for a given taxon, all subjects within and between groups are independent.

Combining Assumption 1 in (14.12) and Assumption 2 in (14.13), the expected and variance of O_ijk are defined as:

$\begin{array}{l}E\left({O}_{ij k}\right)={c}_{jk}{\theta}_{ij}\\ {} Var\left({O}_{ij k}\right)=f\left({\sigma}_{w, ij k}^2,{\sigma}_{b, ij}^2\right):= {\sigma_t}_{, ij k}^2.\end{array}$

(14.14)

Part 3: Introduce a linear regression model framework for log-transformed OTU counts data to include the sample specific bias due to sampling fractions.

Under the above setting, the linear model framework for log-transformed OTU counts data is written as below.

${y}_{ij k}={d}_{jk}+{\mu}_{ij}+{\varepsilon}_{ij k},$

(14.15)

where

$\begin{array}{c}E\left({\varepsilon}_{ij k}\right)=0,\\ {}E\left({y}_{ij k}\right)={d}_{jk}+{\mu}_{ij},\\ {}\mathrm{Var}\left({y}_{ij k}\right)=\mathrm{Var}\left({\varepsilon}_{ij k}\right):= {\sigma}_{ij k}^2.\end{array}$

(14.16)

Note that d in the above equation simply reflects the effect of c. It is not exactly log(c) due to Jensenʼs inequality. The above model formulation is distinctive from the standard one-way ANOVA: it shows that the sample-specific bias due to sampling fractions is introduced in the modeling. For details, the readers are referred to the original publication (Lin and Peddada 2020a).

Part 4: Develop a linear regression model that estimate the sample-specific bias and ensure that the estimator and the test statistic are asymptotically centered at zero under the null hypothesis.

Since the sample-specific bias is introduced because of the differential sampling fraction by each sample, thus, the goal of ANCOM-BC is to eliminate this bias. Given a large number of taxa on each subject, to estimate this bias, ANCOM-BC borrows information across taxa in its methodology.

The framework of least squares was used to develop bias and variance of bias estimation under the null hypothesis, which are estimated as follows.

$\begin{array}{c}{\hat{d}}_{jk}={\overline{y}}_{. jk}-{\overline{y}}_{.j},k=1,\dots, {n}_j,j=1,2,\dots, g,\\ {}{\hat{\mu}}_{ij}={\overline{y}}_{ij.}-{\overline{\hat{d}}}_{j.}={\overline{y}}_{ij.},i=1,\cdots, m.\end{array}$

(14.17)

where note that $E\left({\hat{\mu}}_{ij}\right)=E\left({\overline{y}}_{ij.}\right)={\mu}_{ij}+{\overline{d}}_{j.}$ . Thus, for each j = 1, 2, …, g, Lin and Peddada (2020a) approve that ${\hat{\mu}}_{ij}$ is a biased estimator and $E\left({\hat{\mu}}_{i1}-{\hat{\mu}}_{i2}\right)=\left({\mu}_{i1}-{\mu}_{i2}\right)+\left({\overline{d}}_{1.}-{\overline{d}}_{2.}\right).$ For two experimental groups with balanced design (i.e., g = 2 and n₁ = n₂ = n) and given two ecosystems, for each taxon i, i = 1, ⋯, m, the test hypotheses are $\begin{array}{l}{H}_0:{\mu}_{i1}={\mu}_{i2},\\ {}{H}_a:{\mu}_{i1}\ne {\mu}_{i2}.\end{array}$

By denoting $\delta =\left({\overline{d}}_{1.}-{\overline{d}}_{2.}\right)$ , under the null hypothesis, $E\left({\hat{\mu}}_{i1}-{\hat{\mu}}_{i2}\right)=\delta \ne 0$ and hence is biased. From (12.15) and Lyapunov central limit theorem, Lin and Peddada (2020a) show that

$\frac{{\hat{\mu}}_{ij}-\left({\mu}_{ij}+{\overline{d}}_{j.}\right)}{\sigma_{ij}}{\to}_dN\left(0,1\right)\kern0.5em \mathrm{as}\kern0.5em n\to \infty,$

(14.18)

where ${\sigma}_{ij}^2=\mathrm{Var}\left({\hat{\mu}}_{ij}\right)=\mathrm{Var}\left({\overline{y}}_{ij.}\right)=\frac{1}{n^2}\sum \limits_{k=1}^n{\sigma}_{ij k}^2.$ Lin and Peddada (2020a) show that the taxa can be modeled using a Gaussian mixtures model and the expectation-maximization (EM) algorithm (i.e., ${\hat{\delta}}_{\mathrm{EM}}$ , which denotes the resulting estimator of δ). They also show that that ${\hat{\delta}}_{\mathrm{EM}}$ and ${\hat{\delta}}_{\mathrm{WLS}}$ (denoting the weighted least squares (WLS) estimator of δ) are highly correlated, are approximately unbiased, and thus can use ${\hat{\delta}}_{\mathrm{WLS}}$ to approximate for ${\hat{\delta}}_{\mathrm{EM}}$ .

Therefore, under some further assumptions and developments, for hypothesis testing for two-group comparison: $\begin{array}{l}{H}_0:{\mu}_{i1}={\mu}_{i2}\\ {}{H}_a:{\mu}_{i1}\ne {\mu}_{i2},\end{array}$ for taxon i , the following test statistic is approximately centered at zero under the null hypothesis:

${W}_i=\frac{{\hat{\mu}}_{i1}-{\hat{\mu}}_{ii2}-{\hat{\delta}}_{\mathrm{EM}}}{\sqrt{\sigma_{i1}^2+{\sigma}_{i2}^2}}.$

(14.19)

From Slutsky’s theorem,

${W}_i{\to}_dN\left(0,1\right)\kern0.24em \mathrm{as}\kern0.5em m,n\to \infty .$

(14.20)

The above test statistic is modified as follows if the sample size is not very large and/or the number of non-null taxa is very large:

${W}_i^{\ast }=\frac{{\hat{\mu}}_{i1}-{\hat{\mu}}_{ii2}-{\hat{\delta}}_{\mathrm{WLS}}}{\sqrt{{\hat{\sigma}}_{i1}^2+{\hat{\sigma}}_{i2}^2+\hat{\mathrm{V}}\mathrm{ar}\left({\hat{\delta}}_{\mathrm{WLS}}\right)+2\sqrt{\left({\hat{\sigma}}_{i1}^2+{\hat{\sigma}}_{i2}^2\right)\hat{\mathrm{V}}\mathrm{ar}\left({\hat{\delta}}_{\mathrm{WLS}}\right)}}}.$

(14.21)

Part 5: Adjust the P-values from the tests of taxon.

To control the FDR for multiple comparisons, ANCOM-BC recommends using the Holm-Bonferroni method (Holm 1979) or Bonferroni (Dunn 1958, 1961) correction instead of the Benjamini-Hochberg (BH) procedure (Benjamini and Hochberg 1995) to adjust the raw P-values. This is considering the findings in literatures that it is more appropriate to control the FDR using the Holm-Bonferroni and methods when P -values were not accurate (Lim et al. 2013) and the BH procedure controls the FDR if the data have either independence or some special correlation structures (e.g., perhaps positive regression dependence among taxa) (Benjamini and Hochberg 1995; Benjamini and Yekutieli 2001). Compared to the Bonferroni correction, ANCOM-BC results adjusted by BH procedure has larger power.

Part 6: Develop the test statistic for hypothesis testing for multigroup comparison to conduct the global test.

The test statistic for multigroup comparison was developed in the same way to the two-group comparison. That is, first getting the initial estimates of ${\hat{\mu}}_{ij}$ and ${\hat{d}}_{ij}$ , then setting the reference group r (e.g., r = 1), and obtaining the estimator of the bias term ${\hat{\delta}}_{rj}$ through E-M algorithm, the final estimator of mean absolute abundance of the ecosystem (in log scale) are obtained by transforming ${\hat{\mu}}_{ij}$ of (14.17) into:

${\hat{\mu}}_{ij}^{\ast}:= \left\{\begin{array}{cc}{\hat{\mu}}_{ir},& j=r\\ {}{\hat{\mu}}_{ij}+{\hat{\delta}}_{rj},& j\ne r\in 1,\dots, g\end{array}\right..$

(14.22)

The test statistic for pairwise comparison is defined as:

${W}_{i,j{j}^{\prime }}=\frac{{\hat{\mu}}_{ij}^{\ast }-{\hat{\mu}}_{i{j}^{\prime}}^{\ast }}{\sqrt{{\hat{\sigma}}_{ij}^2+{\hat{\sigma}}_{i{j}^{\prime}}^2}},i=1,\dots, m,j\ne {j}^{\prime}\in \left\{1,\dots, g\right\}.$

(14.23)

For computational simplicity, the global test statistic is reformulated based on William’s type of test (Williams 1971, 1977; Peddada et al. 2001; Farnan et al. 2014) as follows:

${W}_i=\underset{j\ne j\hbox{'}\in \left\{1,\dots, g\right\}}{\max}\mid {W}_{i, jj\hbox{'}}\mid, i=1,\dots, m.$

(14.24)

Under null, W_{i, jj'}→_dN(0, 1), thus for each specific taxon i, the null distribution of W_i can be constructed by simulations. Finally, P-value is calculated as:

${p}_i=\frac{1}{B}\sum \limits_{b=1}^BI\left({W}_i^{(b)}>{W}_i\right),i=1,\dots, m,$

(14.25)

and the Bonferroni correction is applied to control the FDR, where B is the times of simulations (e.g., B = 1000).

14.4.2 Implement ANCOM-BC Using the ANCOMBC Package

ANCOM-BC method is implemented via the ancombc () function in the ANCOMBC package. This package was developed to identify those taxa that their absolute abundances per unit volume in the ecosystem (e.g., gut, mouth or vagina) are differentially abundant between treatment groups/conditions while allowing adjustment of other covariates of interest. In current version (ANCOM-BC (v.1.2.2), October 2021), ANCOM-BC is performed for cross-sectional data.

One syntax is given as:

ancombc(phyloseq, formula, p_adj_method = “holm”, zero_cut = 0.9, lib_cut = 1000, group = NULL, struc_zero = FALSE, neg_lb = FALSE, tol = 1e-05, max_iter = 100, conserve = FALSE, alpha = 0.05, global = TRUE)

where the argument:

phyloseq is a phyloseq-class object, consisting of a feature (OUT/ASV) table (in ANCOM-BC, refers to microbial observed abundance table), a sample metadata, a taxonomy table (optional), and a phylogenetic tree (optional). As we described in Chap. 2, to create a phyloseq-class object, the row names of the metadata must match the sample names of the feature table, and the row names of the taxonomy table must match the taxon (feature) names of the feature table.
formula is a the character string expression. It is used to specify how the microbial absolute abundances for each taxon (response variable) depend on the variables in metadata (predictor variables or covariates) such as we can specify that formula = “Age + Gender + BMI + Group”.
p_adj_method is used to specify the method for adjusting P-values. The default method is “holm” (Holm 1979). The optional methods include “holm”, “hochberg” (Hochberg 1988), “hommel” (Hochberg 1988), “bonferroni” (Bonferroni 1936), “BH” or its alias “fdr” (Benjamini and Hochberg 1995), “BY” (Benjamini and Yekutieli 2001), and “none.” Tukey’s method is not available in this package. Type ?p.adjust() to check the options and references in R. If you do not want to adjust the P-value, use the pass-through option (“none”).
zero_cut is a numerical fraction with range of (0, 1). It is used to specify the proportion of zero values for taxa greater than the cutoff values will be excluded in the analysis. Default is 0.90.
lib_cut is used to specify a numerical threshold for filtering samples based on library sizes. Samples with library sizes less than this cutoff values will be excluded in the analysis. Default is 0, without filtering any sample.
group is the group variable in metadata. It is required for detecting structural zeros and performing global test.
struc_zero is used to specify whether to detect structural zeros. Default is FALSE.
neg_lb is used to specify to classify a taxon as a structural zero in the corresponding study group using its asymptotic lower bound. Default is FALSE.
tol is the iteration convergence tolerance for the E-M algorithm. Default is 1e-05.
max_iter is used to specify the maximum number of iterations for the E-M algorithm. Default is 100.
conserve is used to specify a conservative variance estimate of the test statistic. Default is FALSE. Because ANCOM-BC may not perform well when the sample sizes are very small (e.g., n ≤ 5 per group) or when the proportion of differentially abundant taxa is too large (e.g., >75%) (Lin and Peddada 2020a), it is recommended to use in the case of small sample size and/or large number of differentially abundant taxa.
alpha is used to specify the level of significance. Default is 0.05.
global is used to specify a global test. Default is FALSE.

Example 14.4: Breast Cancer QTRT1 Mouse Gut Microbiome, Example 14.2, Cont.

In this dataset, the mouse fecal samples were collected for two genotype groups (KO and WT) at two time points: pretreatment and posttreatment. We first use ANCOMBC primary analysis to obtain primary results of genotype, time effects, and their interaction effects. Then we use ANCOMBC global test to obtain the global test results of genotype and time interaction.

First, type the following commands to download, install, and load the ANCOMBC package.

> if (!requireNamespace("BiocManager", quietly = TRUE))

+ install.packages("BiocManager")

> BiocManager::install("ANCOMBC")

> library(ANCOMBC)

Then, load OTU-table, taxonomy table, and sample metadata.

> setwd("~/Documents/QIIME2R/Ch14_Compositional/ANCOMBC")

> otu<-read.csv("otu_table_L7_MCF7_phyloseq.csv",row.names = 1)

> tax<-read.csv("tax_table_L7_MCF7_phyloseq.csv",row.names = 1)

> sam<-read.csv("metadata_QtRNA.csv",row.names = 1)

Next, create a phyloseq-class object.

> otumat<-as.matrix(otu)

> taxmat<-as.matrix(tax)

> class(otumat)

[1] "matrix" "array"

> class(taxmat)

[1] "matrix" "array"

> class(sam)

[1] "data.frame"

> library("phyloseq")

> # Merge otu table, taxa table and meta table to create a phyloseq object

> otu_tab = otu_table(otumat, taxa_are_rows = TRUE)

> tax_tab = tax_table(taxmat)

> meta_tab = sample_data(sam)

> physeq = phyloseq(otu_tab, tax_tab, meta_tab)

> physeq

phyloseq-class experiment-level object

otu_table() OTU Table: [ 635 taxa and 40 samples ]

sample_data() Sample Data: [ 40 samples by 6 sample variables ]

tax_table() Taxonomy Table: [ 635 taxa by 7 taxonomic ranks ]

Finally, implement the ancombc() function to perform primary analysis and global test.

We can specify global = TRUE to request both global test and primary analysis.

> library(ANCOMBC)

> output = ancombc(phyloseq = physeq, formula = "Group*Time",

+ p_adj_method = "holm", zero_cut = 0.90, lib_cut = 1000,

+ group = "Genotype",struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5,

+ max_iter = 100, conserve = TRUE, alpha = 0.05, global = TRUE)

> rest_primary = output$res

> rest_global = output$res_global

ANCOMBC Primary Analysis.

The following commands request only the primary analysis.

> library(ANCOMBC)

> output = ancombc(phyloseq = physeq, formula = "Group*Time",

+ p_adj_method = "holm", zero_cut = 0.90, lib_cut = 1000,

+ group = "Genotype",struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5,

+ max_iter = 100, conserve = TRUE, alpha = 0.05, global = FALSE)

The ancombc() function uses raw counts as input data. We specify the p_adj_method to be “holm,” which is default. All other options were also left as default. This calling returns the following six primary results from the ANCOM-BC log-linear model to determine taxa that are differentially abundant based on the specified covariates:

1)
res$beta (coefficients: a data.frame of coefficients).
2)
res$se (standard errors: a data.frame of standard errors (SEs) of beta).
3)
res$W(test statistics: a data.frame of test statistics. W = beta/se).
4)
res$p_val (P-values: a data.frame of P-values obtained from two-sided Z-test using the test statistic W).
5)
res$q_val (adjusted P-values: a data.frame of adjusted P-values obtained by applying the p_adj_method to p_val).
6)
res$diff_abn (differentially abundant taxon indicators: a logical data.frame. TRUE if the taxon has q_val less than alpha otherwise FALSE).

Other returned results include:

res$feature_table: a data.frame of pre-processed (based on zero_cut and lib_cut) microbial observed abundance table).
res$zero_ind: a logical matrix with TRUE indicating the taxon is identified as a structural zero for the specified group variable.
res$samp_frac: a numeric vector of estimated sampling fractions in log scale (natural log)).
res$resid: a matrix of residuals from the ANCOM-BC log-linear (natural log) model. Rows are taxa and columns are samples.
res$delta_em: estimated bias terms through E-M algorithm.
res$delta_wls: estimated bias terms through weighted least squares (WLS) algorithm.

Microbiome data may exist three kinds of zeros: structural zeros, sampling zeros, and rounded zeros. Their definitions can be found in Xia et al. (2018c) and the strategies of dealing with them in compositional microbiome data have been described in Xia et al. (2018a).

Briefly, structural zeros are referred to as zero accounts in experimental groups where the taxon is not expected to be present. ANCOM-BC uses the following two criteria defined in Kaul et al. (2017) to detect structural zeros. Let p_ij represent the proportion of the i^th taxon from non-zero samples in the j^th group, and let ${\hat{p}}_{ij}=\frac{1}{n_j}{\sum}_{k=1}^{n_j}I\left({O}_{ij k}\ne 0\right)$ denote the estimate of p_ij. Then the i^th taxon is declared to exist structural zeros in the j^th group if either of the following is true.

1.
${\hat{p}}_{ij}=0$ .
2.
${\hat{p}}_{ij}-1.96\sqrt{\frac{{\hat{p}}_{ij}\left(1-{\hat{p}}_{ij}\right)}{n_j}}\le 0$ .

If a taxon is declared to be a structural zero in an experimental group, then, for that specific ecosystem, ANCOM-BC does not include this taxon in the analysis (Lin and Peddada 2020a). To apply both criteria stated in above definitions of structural zeros, set neg_lb = TRUE; otherwise, ANCOM-BC uses only Equation 1 to declare structural zeros. Setting neg_lb = TRUE is recommended when the sample size per group is relatively large (e.g., >30).

We create the object “rest_primary” to present the primary results(output$res) for operation.

> rest_primary = output$res

We use the following seven steps to obtain ANCOMBC primary results in which genotype, time, and their interaction effects are analyzed by the ancombc() function (Table 14.4).

Step 1: Obtain the estimated coefficients of beta.

Table 14.4

The estimated coefficients of beta in QtRNA microbiome data

	Kingdom	Phylum	Class	Order	Family	Genus	Species	Group	TimePost	Group:TimePost
OTU_56	D_0__Bacteria	D_1__Actinobacteria	D_2__Coriobacteriia	D_3__Coriobacteriales	D_4__Eggerthellaceae	D_5__Enterorhabdus	D_6__Enterorhabdus caecimuris B7	0.000463	0.049312	0.012773
OTU_57	D_0__Bacteria	D_1__Actinobacteria	D_2__Coriobacteriia	D_3__Coriobacteriales	D_4__Eggerthellaceae	D_5__Enterorhabdus	D_6__Enterorhabdus muris	0.856548	0.604616	−0.83636
OTU_58	D_0__Bacteria	D_1__Actinobacteria	D_2__Coriobacteriia	D_3__Coriobacteriales	D_4__Eggerthellaceae	D_5__Enterorhabdus	D_6__mouse gut metagenome	−0.35507	1.136966	−0.0586
OTU_77	D_0__Bacteria	D_1__Bacteroidetes	D_2__Bacteroidia	D_3__Bacteroidales	D_4__Bacteroidaceae	D_5__Bacteroides	D_6__Bacteroides acidifaciens JCM 10556	−0.06885	0.389432	−0.32735
OTU_101	D_0__Bacteria	D_1__Bacteroidetes	D_2__Bacteroidia	D_3__Bacteroidales	D_4__Muribaculaceae	D_5__Muribaculum	D_6__Parabacteroides sp. YL27	0.000463	0.118627	3.06081
OTU_131	D_0__Bacteria	D_1__Bacteroidetes	D_2__Bacteroidia	D_3__Bacteroidales	D_4__Tannerellaceae	D_5__Parabacteroides	D_6__Parabacteroides goldsteinii CL02T12C30	−0.8368	−1.42599	−1.49273
OTU_221	D_0__Bacteria	D_1__Cyanobacteria	D_2__Oxyphotobacteria	D_3__Nostocales	D_4__Nostocaceae	Other	Other	0.375805	0.326571	−0.5012
OTU_263	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Bacillales	D_4__Staphylococcaceae	D_5__Staphylococcus	D_6__Staphylococcus saprophyticus subsp. saprophyticus	−0.13817	−0.47644	0.220717
OTU_264	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Bacillales	D_4__Staphylococcaceae	D_5__Staphylococcus	D_6__Staphylococcus sp. UAsDu23	−0.93095	−1.7192	1.013497
OTU_276	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Lactobacillales	D_4__Lactobacillaceae	D_5__Lactobacillus	D_6__Lactobacillus murinus	0.139093	0.257256	−0.3338
OTU_279	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Lactobacillales	D_4__Lactobacillaceae	D_5__Lactobacillus	D_6__gut metagenome	0.103425	0.106848	−0.3338
OTU_287	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Lactobacillales	D_4__Streptococcaceae	D_5__Streptococcus	D_6__Streptococcus danieliae	−0.25306	0.840292	0.151861
OTU_323	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__ASF356	D_6__Clostridium sp. ASF356	0.000463	−0.02	3.469499
OTU_325	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Acetatifactor	D_6__Clostridiales bacterium CIEAF 015	0.156511	−2.28364	−0.26737
OTU_330	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Anaerostipes	D_6__Clostridiales bacterium VE202−09	−0.11611	−0.50183	−0.3425
OTU_351	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae NK4A136 group	D_6__Clostridiales bacterium CIEAF 020	0.748644	0.79693	−1.77908
OTU_353	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae NK4A136 group	D_6__Lachnospiraceae bacterium A4	−0.0377	−0.42619	−0.44784
OTU_354	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae NK4A136 group	D_6__Lachnospiraceae bacterium COE1	1.068315	0.69781	−2.21009
OTU_361	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae UCG-006	D_6__Clostridium sp. ASF502	−0.30308	−0.63509	−0.45926
OTU_366	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Marvinbryantia	D_6__Clostridiales bacterium CIEAF 012	−0.03092	0.015579	−0.63559
OTU_369	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Roseburia	D_6__Clostridium sp. Clone-44	0.248954	−0.02	−0.30503
OTU_370	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Roseburia	D_6__Eubacterium sp. 14-2	−1.37716	−2.75447	1.538724
OTU_395	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Anaerotruncus	D_6__Anaerotruncus sp. G3(2012)	0.133685	−1.46779	0.071386
OTU_408	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Ruminiclostridium 5	D_6__Clostridium sp. Culture Jar-8	0.000463	4.359209	−4.29712
OTU_409	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Ruminiclostridium 5	D_6__Ruminiclostridium sp. KB18	0.564388	2.346951	−1.2909
OTU_412	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Ruminiclostridium 9	D_6__bacterium enrichment culture clone M244	−0.34294	1.211262	−1.14918
OTU_516	D_0__Bacteria	D_1__Proteobacteria	D_2__Alphaproteobacteria	D_3__Rickettsiales	D_4__Mitochondria	D_5__Triticum aestivum (bread wheat)	D_6__Triticum aestivum (bread wheat)	−0.13817	−0.15863	0.151402

> tab_coef = rest_primary$beta

> options(digits=2)

> tax_tab= physeq@tax_table[rownames(tab_coef), ]

> options(max.print=999999)

> coef_rest <- as.data.frame(cbind(tax_tab,tab_coef))

> # Table 14.4

> # Write results table

> # Make the table

> library(xtable)

> table1 <- xtable(coef_rest,caption = "Table of Coefficients from the Primary Results",lable="Coef_taxa_table")

> print.xtable(table1,type="html",file = "ANCOMBC_Table_Coef_QtRNA.html")

> write.csv(coef_rest ,file = paste("ANCOMBC_Table_Coef_QtRNA.csv",sep = ""))

The above table is extracted from the ANCOMBC_Table_Coef_QtRNA.

Step 2: Obtain the estimated standard errors of beta.

> tab_se = rest_primary$se

> options(max.print=999999)

> se_rest <- as.data.frame(cbind(tax_tab,tab_se))

> library(xtable)

> table2 <- xtable(se_rest,caption = "Table of Standard Erros from the Primary Results",lable="SE_taxa_table")

> print.xtable(table2,type="html",file = "ANCOMBC_Table_SE_QtRNA.html")

> write.csv(se_rest,file = paste("ANCOMBC_Table_SE_QtRNA.csv",sep = ""))

Step 3: Obtain the test statistics.

> tab_w = rest_primary$W

> options(max.print=999999)

> w_rest <- as.data.frame(cbind(tax_tab,tab_w))

> library(xtable)

> table3 <- xtable(w_rest,caption = "Table of Test Statistics from the Primary Results",lable="Statistics_taxa_table")

> print.xtable(table3,type="html",file = "ANCOMBC_Table_Statistics_QtRNA.html")

> write.csv(w_rest,file = paste("ANCOMBC_Table_Statistics_QtRNA.csv",sep = ""))

Step 4: Obtain the P-values from the two-sided Z-test using the test statistics.

> tab_p = rest_primary$p_val

> options(max.print=999999)

> p_rest <- as.data.frame(cbind(tax_tab,tab_p))

> library(xtable)

> table4 <- xtable(p_rest,caption = "Table of P-values from the Primary Results",lable="P-values_taxa_table")

> print.xtable(table4,type="html",file = "ANCOMBC_Table_P-values_QtRNA.html")

> write.csv(p_rest,file = paste("ANCOMBC_Table_P-values_QtRNA.csv",sep = ""))

Step 5: Obtain the Adjusted P-values by applying the p_adj_method() function.

As summarized in above Part 5 of ANCOM-BC methods, it was shown (Lim et al. 2013) that Holm-Bonferroni (Holm 1979) or Bonferroni (Bonferroni 1936; Dunn 1958, 1961) correction method is more appropriate to control the FDR when P-values were not accurate. The Benjamini-Hochberg (BH) procedure controls the FDR if the taxa have either independent structures or some positive correlation structures(Benjamini and Hochberg 1995; Benjamini and Yekutieli 2001). Thus, to control the FDR due to multiple comparisons, ANCOM-BC recommends applying the Holm-Bonferroni or Bonferroni correction method instead of the BH procedure to adjust the raw P-values. It was shown in ANCOM-BC (Lin and Peddada 2020a) that the P-values adjusted by Bonferroni correction is more conservative, while BH procedure results in FDR around the nominal level (5%), maintaining larger power (Table 14.5).

Table 14.5

The adjusted P-values of differentially abundant analysis of taxa in QtRNA microbiome data

	Kingdom	Phylum	Class	Order	Family	Genus	Species	Group	TimePost	Group:TimePost
OTU_56	D_0__Bacteria	D_1__Actinobacteria	D_2__Coriobacteriia	D_3__Coriobacteriales	D_4__Eggerthellaceae	D_5__Enterorhabdus	D_6__Enterorhabdus caecimuris B7	1	1	1
OTU_57	D_0__Bacteria	D_1__Actinobacteria	D_2__Coriobacteriia	D_3__Coriobacteriales	D_4__Eggerthellaceae	D_5__Enterorhabdus	D_6__Enterorhabdus muris	1	1	1
OTU_58	D_0__Bacteria	D_1__Actinobacteria	D_2__Coriobacteriia	D_3__Coriobacteriales	D_4__Eggerthellaceae	D_5__Enterorhabdus	D_6__mouse gut metagenome	1	0.012024	1
OTU_77	D_0__Bacteria	D_1__Bacteroidetes	D_2__Bacteroidia	D_3__Bacteroidales	D_4__Bacteroidaceae	D_5__Bacteroides	D_6__Bacteroides acidifaciens JCM 10556	0	1	0
OTU_101	D_0__Bacteria	D_1__Bacteroidetes	D_2__Bacteroidia	D_3__Bacteroidales	D_4__Muribaculaceae	D_5__Muribaculum	D_6__Parabacteroides sp. YL27	1	1	0.409359
OTU_131	D_0__Bacteria	D_1__Bacteroidetes	D_2__Bacteroidia	D_3__Bacteroidales	D_4__Tannerellaceae	D_5__Parabacteroides	D_6__Parabacteroides goldsteinii CL02T12C30	1	0.522427	1
OTU_231	D_0__Bacteria	D_1__Deferribacteres	D_2__Deferribacteres	D_3__Deferribacterales	D_4__Deferribacteraceae	D_5__Mucispirillum	D_6__Mucispirillum schaedleri ASF457	1	1	1
OTU_263	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Bacillales	D_4__Staphylococcaceae	D_5__Staphylococcus	D_6__Staphylococcus saprophyticus subsp. saprophyticus	1	0.54287	1
OTU_264	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Bacillales	D_4__Staphylococcaceae	D_5__Staphylococcus	D_6__Staphylococcus sp. UAsDu23	1	4.80E-06	1
OTU_276	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Lactobacillales	D_4__Lactobacillaceae	D_5__Lactobacillus	D_6__Lactobacillus murinus	0	1	0
OTU_279	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Lactobacillales	D_4__Lactobacillaceae	D_5__Lactobacillus	D_6__gut metagenome	1	1	1
OTU_287	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Lactobacillales	D_4__Streptococcaceae	D_5__Streptococcus	D_6__Streptococcus danieliae	1	1	1
OTU_323	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__ASF356	D_6__Clostridium sp. ASF356	0	1	0
OTU_325	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Acetatifactor	D_6__Clostridiales bacterium CIEAF 015	1	0.008759	1
OTU_330	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Anaerostipes	D_6__Clostridiales bacterium VE202-09	1	1	1
OTU_351	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae NK4A136 group	D_6__Clostridiales bacterium CIEAF 020	1	1	1
OTU_353	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae NK4A136 group	D_6__Lachnospiraceae bacterium A4	1	1	1
OTU_354	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae NK4A136 group	D_6__Lachnospiraceae bacterium COE1	1	1	0.169092
OTU_361	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae UCG-006	D_6__Clostridium sp. ASF502	1	1	1
OTU_366	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Marvinbryantia	D_6__Clostridiales bacterium CIEAF 012	1	1	1
OTU_369	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Roseburia	D_6__Clostridium sp. Clone-44	0	1	0
OTU_370	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Roseburia	D_6__Eubacterium sp. 14-2	1	0.00047	1
OTU_395	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Anaerotruncus	D_6__Anaerotruncus sp. G3(2012)	1	0.000105	1
OTU_408	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Ruminiclostridium 5	D_6__Clostridium sp. Culture Jar-8	0	2.38E-47	0
OTU_409	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Ruminiclostridium 5	D_6__Ruminiclostridium sp. KB18	1	0.636737	1
OTU_412	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Ruminiclostridium 9	D_6__bacterium enrichment culture clone M244	0	1	0
OTU_516	D_0__Bacteria	D_1__Proteobacteria	D_2__Alphaproteobacteria	D_3__Rickettsiales	D_4__Mitochondria	D_5__Triticum aestivum (bread wheat)	D_6__Triticum aestivum (bread wheat)	0	1	0

> tab_q = rest_primary$q

> options(max.print=999999)

> q_rest <- as.data.frame(cbind(tax_tab,tab_q))

> # Table 14.5

> # Write results table

> # Make the table

> library(xtable)

> table5 <- xtable(q_rest, caption = "Table of Adjusted p-values from the Primary Results",lable="Adjusted p-values_taxa_table")

> print.xtable(table5,type="html",file = "ANCOMBC_Table_Adjusted p-values_QtRNA.html")

> write.csv(q_rest,file = paste("ANCOMBC_Table_Adjusted p-values_QtRNA.csv",sep = ""))

Step 6: Identify the differentially abundant taxa (Table 14.6).

Table 14.6

The identified differentially abundant taxa in QtRNA microbiome data

	Kingdom	Phylum	Class	Order	Family	Genus	Species	Group	TimePost	Group:TimePost
OTU_56	D_0__Bacteria	D_1__Actinobacteria	D_2__Coriobacteriia	D_3__Coriobacteriales	D_4__Eggerthellaceae	D_5__Enterorhabdus	D_6__Enterorhabdus caecimuris B7	FALSE	FALSE	FALSE
OTU_57	D_0__Bacteria	D_1__Actinobacteria	D_2__Coriobacteriia	D_3__Coriobacteriales	D_4__Eggerthellaceae	D_5__Enterorhabdus	D_6__Enterorhabdus muris	FALSE	FALSE	FALSE
OTU_58	D_0__Bacteria	D_1__Actinobacteria	D_2__Coriobacteriia	D_3__Coriobacteriales	D_4__Eggerthellaceae	D_5__Enterorhabdus	D_6__mouse gut metagenome	FALSE	TRUE	FALSE
OTU_77	D_0__Bacteria	D_1__Bacteroidetes	D_2__Bacteroidia	D_3__Bacteroidales	D_4__Bacteroidaceae	D_5__Bacteroides	D_6__Bacteroides acidifaciens JCM 10556	TRUE	FALSE	TRUE
OTU_101	D_0__Bacteria	D_1__Bacteroidetes	D_2__Bacteroidia	D_3__Bacteroidales	D_4__Muribaculaceae	D_5__Muribaculum	D_6__Parabacteroides sp. YL27	FALSE	FALSE	FALSE
OTU_131	D_0__Bacteria	D_1__Bacteroidetes	D_2__Bacteroidia	D_3__Bacteroidales	D_4__Tannerellaceae	D_5__Parabacteroides	D_6__Parabacteroides goldsteinii CL02T12C30	FALSE	FALSE	FALSE
OTU_231	D_0__Bacteria	D_1__Deferribacteres	D_2__Deferribacteres	D_3__Deferribacterales	D_4__Deferribacteraceae	D_5__Mucispirillum	D_6__Mucispirillum schaedleri ASF457	FALSE	FALSE	FALSE
OTU_263	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Bacillales	D_4__Staphylococcaceae	D_5__Staphylococcus	D_6__Staphylococcus saprophyticus subsp. saprophyticus	FALSE	FALSE	FALSE
OTU_264	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Bacillales	D_4__Staphylococcaceae	D_5__Staphylococcus	D_6__Staphylococcus sp. UAsDu23	FALSE	TRUE	FALSE
OTU_276	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Lactobacillales	D_4__Lactobacillaceae	D_5__Lactobacillus	D_6__Lactobacillus murinus	TRUE	FALSE	TRUE
OTU_279	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Lactobacillales	D_4__Lactobacillaceae	D_5__Lactobacillus	D_6__gut metagenome	FALSE	FALSE	FALSE
OTU_287	D_0__Bacteria	D_1__Firmicutes	D_2__Bacilli	D_3__Lactobacillales	D_4__Streptococcaceae	D_5__Streptococcus	D_6__Streptococcus danieliae	FALSE	FALSE	FALSE
OTU_323	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__ASF356	D_6__Clostridium sp. ASF356	TRUE	FALSE	TRUE
OTU_325	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Acetatifactor	D_6__Clostridiales bacterium CIEAF 015	FALSE	TRUE	FALSE
OTU_330	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Anaerostipes	D_6__Clostridiales bacterium VE202-09	FALSE	FALSE	FALSE
OTU_351	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae NK4A136 group	D_6__Clostridiales bacterium CIEAF 020	FALSE	FALSE	FALSE
OTU_353	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae NK4A136 group	D_6__Lachnospiraceae bacterium A4	FALSE	FALSE	FALSE
OTU_354	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae NK4A136 group	D_6__Lachnospiraceae bacterium COE1	FALSE	FALSE	FALSE
OTU_361	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Lachnospiraceae UCG-006	D_6__Clostridium sp. ASF502	FALSE	FALSE	FALSE
OTU_366	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Marvinbryantia	D_6__Clostridiales bacterium CIEAF 012	FALSE	FALSE	FALSE
OTU_369	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Roseburia	D_6__Clostridium sp. Clone-44	TRUE	FALSE	TRUE
OTU_370	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Lachnospiraceae	D_5__Roseburia	D_6__Eubacterium sp. 14-2	FALSE	TRUE	FALSE
OTU_395	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Anaerotruncus	D_6__Anaerotruncus sp. G3(2012)	FALSE	TRUE	FALSE
OTU_408	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Ruminiclostridium 5	D_6__Clostridium sp. Culture Jar-8	TRUE	TRUE	TRUE
OTU_409	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Ruminiclostridium 5	D_6__Ruminiclostridium sp. KB18	FALSE	FALSE	FALSE
OTU_412	D_0__Bacteria	D_1__Firmicutes	D_2__Clostridia	D_3__Clostridiales	D_4__Ruminococcaceae	D_5__Ruminiclostridium 9	D_6__bacterium enrichment culture clone M244	TRUE	FALSE	TRUE
OTU_516	D_0__Bacteria	D_1__Proteobacteria	D_2__Alphaproteobacteria	D_3__Rickettsiales	D_4__Mitochondria	D_5__Triticum aestivum (bread wheat)	D_6__Triticum aestivum (bread wheat)	TRUE	FALSE	TRUE

> tab_diff = rest_primary$diff_abn

> options(max.print=999999)

> diff_rest <- as.data.frame(cbind(tax_tab,tab_diff))

> # Table 14.6

> # Write results table

> # Make the table

> library(xtable)

> table6 <- xtable(diff_rest, caption = "Table of Differentially Abundant Taxa from the Primary Results",lable="DA_taxa_table")

> print.xtable(table6,type="html",file = "ANCOMBC_Table_Differentially Abundant Taxa_QtRNA.html")

> write.csv(diff_rest,file = paste("ANCOMBC_Table_Differentially Abundant Taxa_QtRNA.csv",sep = ""))

The above table is extracted from the ANCOMBC_Table_Differentially Abundant Taxa_QtRNA.

Step 7: Estimate and adjust the sample-specific sampling fractions to obtain the bias-adjusted abundances.

First, estimate sample-specific sampling fractions by natural log scale. Since the sampling fraction is not estimable for each sample with the presence of missing values, the missing values (NA) for any variable specified in the formula will be replaced with 0.

> samp_frac = output$samp_frac

> # Replace NA with 0

> samp_frac[is.na(samp_frac)] = 0

Then, adjust the log observed abundances by subtracting the estimated sampling fraction from log observed abundances of each sample. Note that to avoid taking the log of 0, which is undefined, a pseudocount (here 1) is added to each estimated abundances before taking natural log.

> # Add pesudo-count (1) to avoid taking the log of 0

> library(microbiome)

> log_obs_abn = log(abundances(physeq) + 1)

> # Adjust the log observed abundances

> log_obs_abn_adj = t(t(log_obs_abn) - samp_frac)

> library(xtable)

> table7 <- xtable(round(log_obs_abn_adj, 2), caption = "Table of Bias-adjusted Log Observed Abundances from the Primary Results",lable="Bias-adjusted Log-OA_taxa_table")

> print.xtable(table7,type="html",file = "ANCOMBC_Table_Bias-adjusted Log Observed Abundances_QtRNA.html")

> write.csv(table7,file = paste("ANCOMBC_Table_Bias-adjusted Log Observed Abundances_QtRNA.csv",sep = ""))

ANCOMBC Global Test.

In the sample metadata, we created a four levels of categorical variable “Group4” (WT_BEFORE, WT_POST, KO_BEFORE, and KO_POST). We used the following commands to illustrate the ANCOM-BC global test using this variable.

> library(ANCOMBC)

> output = ancombc(phyloseq = physeq, formula = "Group4",

+ p_adj_method = "holm", zero_cut = 0.90, lib_cut = 1000,

+ group = "Group4", struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5,

+ max_iter = 100, conserve = TRUE, alpha = 0.05,global = TRUE)

> rest_global = output$res_global

The global test results are contained in the output$res_global (a data.frame for the variable specified in group).

Each column contains the test statistics (W), the P-values (p_val) from the two-sided chi-square test using W, the adjusted P-values (q_val) that are obtained by applying the p_adj_method to p_val, and the logical vector (diff_abn) of differentially abundant taxa with TRUE if the taxon has q_val less than alpha, otherwise with FALSE. Like the primary analysis, we can obtain the obtain the ANCOMBC global test results using the following steps:

Step 1: Obtain the test global statistics.

> tab_w = rest_global$W

> options(max.print=999999)

> w_rest <- as.data.frame(cbind(tax_tab,tab_w))

> library(xtable)

> table8 <- xtable(w_rest,caption = "Table of Global Test Statistics from the Global Results",lable="Global_ Statistics_taxa_table")

> print.xtable(table8,type="html",file = "ANCOMBC_Table_ Global_Statistics_QtRNA.html")

> write.csv(w_rest,file = paste("ANCOMBC_Table_ Global_Statistics_QtRNA.csv",sep = ""))

Step 2: Obtain the P-values from the two-sided Z-test using the global test statistics.

> tab_p = rest_global$p_val

> options(max.print=999999)

> p_rest <- as.data.frame(cbind(tax_tab,tab_p))

> library(xtable)

> table9 <- xtable(p_rest,caption = "Table of P-values from the Global Results",lable=" Global_P-values_taxa_table")

> print.xtable(table9,type="html",file = "ANCOMBC_Table_ Global_P-values_QtRNA.html")

> write.csv(p_rest,file = paste("ANCOMBC_Table_ Global_P-values_QtRNA.csv",sep = ""))

Step 3: Adjust the P-values by applying the p_adj_method() function for the global test.

> tab_q = rest_global$q

> options(max.print=999999)

> q_rest <- as.data.frame(cbind(tax_tab,tab_q))

> library(xtable)

> table10 <- xtable(q_rest, caption = "Table of Adjusted p-values from the Global Results",lable="Adjusted p-values_taxa_table")

> print.xtable(table10,type="html",file = "ANCOMBC_Table_ Global_Adjusted p-values_QtRNA.html")

> write.csv(q_rest,file = paste("ANCOMBC_Table_ Global_Adjusted p-values_QtRNA.csv",sep = ""))

Step 4: Identify the differentially abundant taxa for the global test.

> tab_diff = rest_global$diff_abn

> options(max.print=999999)

> diff_rest <- as.data.frame(cbind(tax_tab,tab_diff))

> library(xtable)

> table11 <- xtable(diff_rest, caption = "Table of Differentially Abundant Taxa from the Global Results",lable="DA_taxa_table")

> print.xtable(table11,type="html",file = "ANCOMBC_Table_ Global_Differentially Abundant Taxa_QtRNA.html")

> write.csv(diff_rest,file = paste("ANCOMBC_Table_ Global_Differentially Abundant Taxa_QtRNA.csv",sep = ""))

14.4.3 Remarks on ANCOM-BC

ANCOM-BC method has some unique characteristics (Lin and Peddada 2020a): (1) It explicitly tests hypothesis of differential absolute abundance of individual taxon and provides valid P-values and confidence intervals for each taxon. (2) It provides an approach to estimate the sampling fraction and performs DA analysis by correcting bias due to (unobservable) differential sampling fractions across samples. (3) It does not rely on strong parametric assumptions. ANCOM-BC has some strengths (Lin and Peddada 2020a):
- It can construct confidence intervals for DA of each taxon whereas ANCOM and other DA methods cannot. For example, we can construct a 95% simultaneous confidence intervals using Bonferroni method for the mean DA of each taxon in the two experimental groups, and hence an effect size that is associated with each taxon can be obtained in comparing two experimental groups.
- It was demonstrated that like ANCOM, ANCOM-BC is able to well control the FDR while maintaining adequate power compared to other often used DA methods, including count-based methods edgeR and DESeq2 in RNA-seq studies, the zero-inflated Gaussian mixture model used in metagenomeSeq (ZIG) (Lin and Peddada 2020a).
ANCOM-BC also has some weaknesses, for example:
- ANCOM-BC cannot well control the FDR for small sample sizes (e.g., n = 5 per group). Ten samples are required for controlling FDR with adequate power (Lin and Peddada 2020a). This is probably because the estimator and the test statistic of ANCOM-BC are developed asymptotically.
- ANCOM-BC assumes that the compositional data on the covariates is linear. Like ANCOM, ANCOM-BC cannot control the FDR well when this linear assumption is violated (Wang 2021).
- ANCOM-BC adopted the methodology (Kaul et al. 2017) for dealing structural zeros, the outlier zeros, and sampling zeros. In practice, ANCOM-BC actually drops taxa in the analysis when the structural zeros associated with these taxa are detected by the criteria (Kaul et al. 2017).
- ANCOM-BC uses log-transformation of OTU counts data rather than log-ratio transformation, ANCOM-BC methodology is not a typical CoDa in the sense of Aitchison’s approach (Aitchison 1986a). To solve the problem of compositionality, ANCOM-BC uses the sum of ecosystem as reference to offset the data. This kind of normalization is not the typical Aitchison’s approach, while like using offset in the count-based methods.

ANCOM and ANCOM-BC share some common features: (1) assuming that the observed sample is an unknown fraction of a unit volume of the ecosystem, and the sampling fraction varies from sample to sample, and (2) using log-ratio transformation (or log scale) to serve as normalization.

However, ANCOM-BC is different from ANCOM in some ways:

Unlike ANCOM, ANCOM-BC does not directly implement a log-ratio transformation, instead simplifies DA analysis by recasting the problem as a linear regression problem with an offset to account for the sampling fraction. That is, ANCOM-BC’s bias correction is a sample-specific offset term that is introduced into a linear regression and accounts for sampling fraction in conjunction with performing linear regression in log scale serves the same purpose as a log-ratio transformation.
Compared to ANCOM, ANCOM-BC is very fast because it avoids repeatedly using a taxon as reference that ANCOM uses. ANCOM needs $\frac{m\left(m-1\right)}{2}$ linear regressions to fit the models, whereas ANCOM-BC requires only m linear regressions to fit.

14.5 Remarks on Compositional Data Analysis Approach

In microbiome study, two kinds of statistical methods are available: count-based approach and compositional data analysis (CoDa) approach. The count-based approach, i.e., negative binomial methods (e.g., edgeR and DESeq2), and over-dispersed and zero-inflated models (e.g., metagenomeSeq) do not take data transformations to bring the data to normality before analysis, instead of using the offset to scale the data during the modeling.

Different from count-based approach, CoDa approach including ALDEx2, ANCOM, and ANCOM-BC typically need some kind of log or log-ratio transformations before analysis.

In some instances, the log-ratio transformation is technically equivalent to a normalization such as alr- and clr-transformations can formally serve as effective library size normalizations. Thus sometime the log-ratio transformation is called the log-ratio “normalization” (Quinn et al. 2018a, b). The parametric methods (ALDEx2 t-test, ANCOM-BC) in CoDa approach further assume a Gaussian distribution of the data in their statistical models. Since it is difficult to measure total microbial load or absolute number of microorganisms to accurately determine taxonomic shifts (Morton et al. 2019), there remains a lack of a gold standard to measure microbiome data. Thus, DA analysis is still controversial in microbiome research. One current effort tries to establish microbial composition measurement standards with reference frames and define the concept of “reference frames” for inferring changes in abundance in CoDa. The “reference frames” are used as a constraint for inferencing how microbial populations change relatively given by other microbial populations. The reference frame for inferring changes is determined by denominator in a log-ratio (Morton et al. 2019).

CoDa methods have their advantages, providing a statistical tool to move from the simplex into real space (Aitchison et al. 2000) and hence enable statistical methods in Euclidean distances. Thus, log-ratio “normalizations” or log-ratio transformations are useful even when they do not normalize the data. As we reviewed in this chapter, ALDEx2, ANCOM, and ANCOM-BC methods have been evaluated and compared to edgeR, DESeq2, metagenomeSeq, and other competing methods. Their robustness and privileges are claimed by the proposed authors.

However, we notice that when ANCOM-BC claimed that both ANCOM and ANCOM-BC are outperformed over edgeR, and DESeq2, and ZIG in terms of FDR control and higher power, their DA analyses in concept are different. For example, ANCOM-BC differentially analyzes absolute abundance rather than differentially analyzes relative abundance. When ANCOM-BC states that it takes into account the compositionality of the microbiome data, it actually uses log-transformation to transform the OTU data or use an offset to account for the sampling fraction. Additionally, both edgeR and DESeq2 were developed based on NB models to account for small sample sizes and individual variation of genes (taxa). Overall, the concept and approach of reference frames remain within the CoDa with log-ratio transformation. Thus, it still has the challenge to deal with zero-problem and in practice how to pick an appropriate reference frame and to understand statistical properties of these reference frames (Morton et al. 2019). We should consider the context when we explain the performance of model comparisons.

However, on the one hand, whether or not microbiome data are really compositional is debatable. Some researchers acknowledged that next-generation sequencing (NGS) abundance data differ slightly from the formally defined compositional data because they contain integer values only, such as NGS abundance data do not require the arbitrary sum to represent complete unity in Aitchison’s definition of compositional data (Aitchison 1982, 1986a). Thus, many datasets such as possibly NGS abundance data lack the information of potential components and hence exist as “incomplete compositions” (Quinn et al. 2018a, b). But this kind of data (e.g., NGS abundance data) is incidentally constrained to an arbitrary sum, and thus it still has compositional properties and can be called “count compositional data” and treated as compositional data (Lovell et al. 2015; Quinn et al. 2017, 2018a, b). In contrast, other researchers do not think that count data is purely relative—the count pair (1, 2) carries different information than counts of (1000, 2000) even though the ratio of the two pair is same. Instead, they think that treating count data as compositional is treating “really discrete count data” as continuous (Bacon-Shone 2008).

Briefly, there are five reasons that drive those researchers to take count-based approach to analyze microbiome data and other NGS data instead using compositional data analysis (see details in Xia et al. 2018a; Xia 2020):

(1)
Compositionality (e.g., the spurious correlation concern) occurs when the data dimension is relatively low (i.e., the number of components is relatively small), such as in ecology studies. However, microbiome data is high dimensional and usually having the large number of taxa. Even there is the compositional effect (e.g., the spurious correlation), its effect is mild when the samples have large diversity (Zhang et al. 2017; Friedman and Alm 2012) because as the number of taxa increases, the compositional effect is attenuating (Hu et al. 2018).
(2)
The mild data compositionality can be corrected by technological developments including the estimation of absolute cellular abundances from microbiome sequence data in microbiome data science (Vandeputte et al. 2017).
(3)
The count-based models do not require the data to be transformed before analysis. It was shown these models have more statistical power to detect differential expression (Robinson and Oshlack 2010; Zhang and Yi 2020). Thus, these researchers advised to treat the high-throughput sequencing datasets as count data (Anders et al. 2013; Kuczynski et al. 2011).
(4)
Compositional data analysis may lack strong biological interpretation. The results from transformation-based analyses should be interpreted with respect to the chosen reference or the results should be translated back into compositional terms. This remains complicated. Otherwise it is difficult to interpret biologically because one purpose of microbiome study is to detect which taxa are associated with the outcome (Hu et al. 2018) rather than to detect the association between ratio of taxa and the outcome. Thus, CoDa uses the ratio of taxa to interpret the results have been criticized because of no clinical-direct-relevance (Xia 2020).
(5)
Compositional data analysis usually takes log-ratio transformation to transform the abundance count data before analysis. To ensure log of zeros be defined, arbitrary small constant counts are added to a zero value. Thus, how best to handle zeros remains a topic of ongoing research and poses the major challenge of CoDa (Xia 2020).

Both compositional and count-based models are based on abundance read counts. Actually, functional study indicates that the rare taxa with lower abundant read counts do not necessarily have lower functions related to the health and disease. Thus, the statistical models that are sensitive to detect these rare taxa are especially important.

14.6 Summary

This chapter focused on principles, methods, and techniques of compositional data analysis in microbiome data. First, we described overall compositional data analysis (CoDa) including definition of compositional data, description of the reasons that some researchers treat microbiome data as compositional, definition of Aitchison simplex, description of the challenges of analyzing compositional data, fundamental principles of CoDa, and the family of log-ratio transformations: additive log-ratio (alr), centered log-ratio (clr), isometric log-ratio (ilr), and inter-quartile log-ratio (iqlr). Next, we introduced and illustrated three CoDa, respectively, in the order of these methods proposed: (1) ANOVA-Like Compositional Differential Abundance Analysis (ALDEx2), (2) Analysis of Composition of Microbiomes (ANCOM), and (3) Analysis of Composition of Microbiomes-Bias Correction (ANCOM-BC). We implemented ALDEx2 and ANCOM-BC with R and ANCOM using QIIME 2. We also provided some general remarks on the approach of CoDa.

In Chap. 15, we will introduce how to analyze longitudinal microbiome data through linear mixed-effects models.

References

Aitchison, John. 1981. A new approach to null correlations of proportions. Mathematical Geology 13 (2): 175–189.
Aitchison, J. 1982. The statistical analysis of compositional data (with discussion). Journal of the Royal Statistical Society, Series B (Statistical Methodology) 44 (2): 139–177.
———. 1983. Principal component analysis of compositional data. Biometrika 70 (1): 57–65. https://doi.org/10.1093/biomet/70.1.57.Crossref
———. 1984. Reducing the dimensionality of compositional data sets. Journal of the International Association for Mathematical Geology 16 (6): 617–635.
———. 1986a. The statistical analysis of compositional data, Monographs on statistics and applied probability. London: Chapman and Hall Ltd. Reprinted in 2003 with additional material by The Blackburn Press.
———. 1986b. The statistical analysis of compositional data. Chapman & Hall; Reprinted in 2003, with additional material, by The Blackburn Press.
Aitchison, John. 2003. A concise guide to compositional data analysis. In: 2nd compositional data analysis workshop, Girona, Italy, 2003.
———. 2008. The single principle of compositional data analysis, continuing fallacies, confusions and misunderstandings and some suggested remedies. In: Proceedings of CoDaWork’08.
Aitchison, J., and J.J. Egozcue. 2005. Compositional data analysis: Where are we and where should we be heading? Mathematical Geology 37 (7): 829–850. https://doi.org/10.1007/s11004-005-7383-7.Crossref
Aitchison, J., C. Barceló-Vidal, J.A. Martín-Fernández, and V. Pawlowsky-Glahn. 2000. Logratio analysis and compositional distance. Mathematical Geology 32 (3): 271–275.
Altman, D.G., and J.M. Bland. 1983. Measurement in medicine: The analysis of method comparison studies. Journal of the Royal Statistical Society. Series D (The Statistician) 32 (3): 307–317. https://doi.org/10.2307/2987937, http://www.jstor.org/stable/2987937.Crossref
Anders, Simon, and Wolfgang Huber. 2010. Differential expression analysis for sequence count data. Genome Biology 11 (10): R106. https://doi.org/10.1186/gb-2010-11-10-r106.CrossrefPubMedPubMedCentral
Anders, S., D.J. McCarthy, Y. Chen, M. Okoniewski, G.K. Smyth, W. Huber, and M.D. Robinson. 2013. Count-based 631 differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols 8. https://doi.org/10.1038/nprot.2013.099.
Bacon-Shone, J. 2008. Discrete and continuous compositions. In Proceedings of CODAWORK’08, the 3rd compositional data analysis workshop, ed. J. Daunis-i-Estadella and J. Martin-Fernández. Girona: University of Girona.
———. 2011. A short history of compositional data analysis. In Compositional data analysis: Theory and applications, ed. V. Pawlowsky-Glahn and A. Buccianti. Chichester: Wiley.
Barceló-Vidal, Carles, Josep A Martín-Fernández, and Vera Pawlowsky-Glahn. 2001. Mathematical foundations of compositional data analysis. In Proceedings of IAMG.
Benjamini, Y., and Y. Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodology) 57: 289–300.
Benjamini, Y., and D. Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29: 1165–1188.
Bland, J. Martin, and Douglas G. Altman. 1999. Measuring agreement in method comparison studies. Statistical Methods in Medical Research 8 (2): 135–160. https://doi.org/10.1177/096228029900800204.CrossrefPubMed
Bonferroni, C. E. 1936. Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze.
Brill, Barak, Amnon Amir, and Ruth Heller. 2019. Testing for differential abundance in compositional counts data, with application to microbiome studies. arXiv preprint arXiv: 1904.08937.
Cohen, J. 1988. Statistical power analysis for the behavioral sciences. 2nd ed. Hillsdale: Erlbaum.
Dunn, Olive Jean. 1958. Estimation of the means of dependent variables. The Annals of Mathematical Statistics 29: 1095–1111.
———. 1961. Multiple comparisons among means. Journal of the American statistical association 56 (293): 52–64.
Eaton, Morris L. 1983. Multivariate statistics: A vector space approach, 512. New York: Wiley.
Egozcue, Juan José. 2009. Reply to “On the Harker Variation Diagrams; …” by J.A. Cortés. Mathematical Geosciences 41 (7): 829–834. https://doi.org/10.1007/s11004-009-9238-0.Crossref
Egozcue, Juan José, and Vera Pawlowsky-Glahn. 2005. Groups of parts and their balances in compositional data analysis. Mathematical Geology 37 (7): 795–828.
Egozcue, J.J., and V. Pawlowsky-Glahn. 2011. Basic concepts and procedures. In Compositional data analysis: Theory and applications, ed. V. Pawlowsky-Glahn and A. Buccianti. Chichester: Wiley.
Egozcue, Juan José, Vera Pawlowsky-Glahn, Glòria Mateu-Figueras, and Carles Barcelo-Vidal. 2003. Isometric logratio transformations for compositional data analysis. Mathematical Geology 35 (3): 279–300.
Farnan, Laura, Anastasia Ivanova, and Shyamal D. Peddada. 2014. Linear mixed effects models under inequality constraints with applications. PloS One 9 (1): e84778.PubMedPubMedCentral
Fernandes, A.D., J.M. Macklaim, T.G. Linn, G. Reid, and G.B. Gloor. 2013. ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-seq. PLoS One 8. https://doi.org/10.1371/journal.pone.0067019.
Fernandes, Andrew D., Jennifer Ns Reid, Jean M. Macklaim, Thomas A. McMurrough, David R. Edgell, and Gregory B. Gloor. 2014. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2: 15. https://doi.org/10.1186/2049-2618-2-15, https://pubmed.ncbi.nlm.nih.gov/24910773, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4030730/.CrossrefPubMedPubMedCentral
Fernandes, Andrew D., Michael T.H.Q. Vu, Lisa-Monique Edward, Jean M. Macklaim, and Gregory B. Gloor. 2018. A reproducible effect size is more useful than an irreproducible hypothesis test to analyze high throughput sequencing datasets. arXiv preprint arXiv: 1809.02623.
Friedman, Jonathan, and Eric J. Alm. 2012. Inferring Correlation Networks from Genomic Survey Data. PLOS Computational Biology 8 (9): e1002687. https://doi.org/10.1371/journal.pcbi.1002687.CrossrefPubMedPubMedCentral
Gloor, Gregory B., and Gregor Reid. 2016. Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. Canadian Journal of Microbiology 62 (8): 692–703. https://doi.org/10.1139/cjm-2015-0821.CrossrefPubMed
Gloor, G.B., J.R. Wu, V. Pawlowsky-Glahn, and J.J. Egozcue. 2016. It’s all relative: analyzing microbiome data as compositions. Annals of Epidemiology 26 (5): 322–329.PubMed
Hawinkel, Stijn, Federico Mattiello, Luc Bijnens, and Olivier Thas. 2017. A broken promise: microbiome differential abundance methods do not control the false discovery rate. Briefings in Bioinformatics 20 (1): 210–221. https://doi.org/10.1093/bib/bbx104.Crossref
Hochberg, Y. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75: 800–803.
Holm, S. 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6: 65–70.
Hron, K., M. Templ, and P. Filzmoser. 2010. Exploratory compositional data analysis using the R-package robCompositions. In Proceedings 9th international conference on computer data analysis and modeling, ed. S. Aivazian, P. Filzmoser, and Yu. Kharin, 179–186. Minsk: Belarusian State University.
Hu, J., H. Koh, L. He, M. Liu, M.J. Blaser, and H. Li. 2018. A two-stage microbial association mapping framework with advanced FDR control. Microbiome 6 (1): 131. https://doi.org/10.1186/s40168-018-0517-1.CrossrefPubMedPubMedCentral
Kaul, Abhishek, Siddhartha Mandal, Ori Davidov, and Shyamal D. Peddada. 2017. Analysis of microbiome data in the presence of excess zeros. Frontiers in Microbiology 8 (2114). https://doi.org/10.3389/fmicb.2017.02114, https://www.frontiersin.org/article/10.3389/fmicb.2017.02114.
Krumbein, W.C. 1962. Open and closed number systems in stratigraphic mapping. AAPG Bulletin 46 (12): 2229–2245.
Kuczynski, J., C.L. Lauber, W.A. Walters, L.W. Parfrey, J.C. Clemente, D. Gevers, and R. Knight. 2011. Experimental and analytical tools for studying the human microbiome. Nature Reviews Genetics 13 (1): 47–58.PubMedPubMedCentral
Li, Hongzhe. 2015. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annual Review of Statistics and Its Application 2: 73–94.
Lim, Changwon, Pranab K. Sen, and Shyamal D. Peddada. 2013. Robust analysis of high throughput screening (HTS) assay data. Technometrics 55 (2): 150–160. https://doi.org/10.1080/00401706.2012.749166, https://pubmed.ncbi.nlm.nih.gov/23908557, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3727440/.CrossrefPubMedPubMedCentral
Lin, Huang, and Shyamal Das Peddada. 2020a. Analysis of compositions of microbiomes with bias correction. Nature Communications 11 (1): 3514. https://doi.org/10.1038/s41467-020-17041-7.CrossrefPubMedPubMedCentral
———. 2020b. Analysis of microbial compositions: a review of normalization and differential abundance analysis. npj Biofilms and Microbiomes 6 (1): 60. https://doi.org/10.1038/s41522-020-00160-w.CrossrefPubMedPubMedCentral
Lovell, D., W. Müller, J. Taylor, A. Zwart, and C. Helliwell. 2011. Proportions, percentages, PPM: Do the molecular biosciences treat compositional data right? In Compositional data analysis: Theory and applications, ed. V. Pawlowsky-Glahn and A. Buccianti. Chichester: Wiley.
Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. 2015. Proportionality: A valid alternative to correlation for relative data. PLoS Computational Biology 11 (3): e1004075. https://doi.org/10.1371/journal.pcbi.1004075, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4361748/.CrossrefPubMedPubMedCentral
Macklaim, Jean M., Andrew D. Fernandes, Julia M. Di Bella, Jo-Anne Hammond, Gregor Reid, and Gregory B. Gloor. 2013. Comparative meta-RNA-seq of the vaginal microbiota and differential expression by Lactobacillus iners in health and dysbiosis. Microbiome 1 (1): 12. https://doi.org/10.1186/2049-2618-1-12.CrossrefPubMedPubMedCentral
Mandal, Siddhartha, Will Van Treuren, Richard A. White, Merete Eggesbø, Rob Knight, and Shyamal D. Peddada. 2015. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial Ecology in Health and Disease 26 (1): 27663. https://doi.org/10.3402/mehd.v26.27663.CrossrefPubMed
Martin Bland, J., and Douglas G. Altman. 1986. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 327 (8476): 307–310. https://doi.org/10.1016/S0140-6736(86)90837-8.Crossref
Martín-Fernández, J.A., C. Barceló-Vidal, and V. Pawlowsky-Glahn. 2003. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology 35 (3): 253–278. https://doi.org/10.1023/A:1023866030544.Crossref
Mateu-Figueras, G., V. Pawlowsky-Glahn, and J.J. Egozcue. 2011. The principle of working on coordinates. In Compositional data analysis: Theory and applications, ed. V. Pawlowsky-Glahn and A. Buccianti. Chichester: Wiley.
Morton, James T., Clarisse Marotz, Alex Washburne, Justin Silverman, Livia S. Zaramela, Anna Edlund, Karsten Zengler, and Rob Knight. 2019. Establishing microbial composition measurement standards with reference frames. Nature Communications 10 (1): 2719. https://doi.org/10.1038/s41467-019-10656-5.CrossrefPubMedPubMedCentral
Nelson, T.M., J.L. Borgogna, R.M. Brotman, J. Ravel, S.T. Walk, and C.J. Yeoman. 2015. Vaginal biogenic amines: biomarkers of bacterial vaginosis or precursors to vaginal dysbiosis? Frontiers in Physiology 6: 253. https://doi.org/10.3389/fphys.2015.00253.CrossrefPubMedPubMedCentral
Pawlowsky-Glahn, V., and A. Buccianti. 2011. In Compositional data analysis: Theory and applications, ed. V. Pawlowsky-Glahn and A. Buccianti. Wiley, Chichester.
Pawlowsky-Glahn, V., J.J. Egozcue, and R. Tolosana-Delgado. 2015. Modeling and analysis of compositional data. London: Wiley.
Pearson, K. 1897. Mathematical contributions to the theory of evolution. On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proceedings of the Royal Society of London LX: 489–502.
Peddada, Shyamal D., Katherine E. Prescott, and Mark Conaway. 2001. Tests for order restrictions in binary data. Biometrics 57 (4): 1219–1227.PubMed
Quinn, Thomas P., Mark F. Richardson, David Lovell, and Tamsyn M. Crowley. 2017. propr: An R-package for identifying proportionally abundant features using compositional data analysis. Scientific Reports 7 (1): 16252. https://doi.org/10.1038/s41598-017-16520-0.CrossrefPubMedPubMedCentral
Quinn, Thomas P., Ionas Erb, Mark F. Richardson, and Tamsyn M. Crowley. 2018a. Understanding sequencing data as compositions: an outlook and review. Bioinformatics 34 (16): 2870–2878. https://doi.org/10.1093/bioinformatics/bty175.CrossrefPubMedPubMedCentral
Quinn, Thomas P., Tamsyn M. Crowley, and Mark F. Richardson. 2018b. Benchmarking differential expression analysis tools for RNA-Seq: normalization-based vs. log-ratio transformation-based methods. BMC Bioinformatics 19 (1): 274. https://doi.org/10.1186/s12859-018-2261-8.CrossrefPubMedPubMedCentral
Rivera-Pinto, J., J.J. Egozcue, V. Pawlowsky-Glahn, R. Paredes, M. Noguera-Julian, and M.L. Calle. 2018. Balances: A new perspective for microbiome analysis. mSystems 3 (4): e00053–e00018. https://doi.org/10.1128/mSystems.00053-18. https://pubmed.ncbi.nlm.nih.gov/30035234, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6050633/.CrossrefPubMedPubMedCentral
Robinson, Mark D., and Alicia Oshlack. 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11 (3): R25. https://doi.org/10.1186/gb-2010-11-3-r25, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/.CrossrefPubMedPubMedCentral
Robinson, M.D., D.J. McCarthy, and G.K. Smyth. 2010. edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 (1): 139–140. https://doi.org/10.1093/bioinformatics/btp616.CrossrefPubMed
Seyednasrollah, Fatemeh, Asta Laiho, and Laura L. Elo. 2013. Comparison of software packages for detecting differential expression in RNA-seq studies. Briefings in Bioinformatics 16 (1): 59–70. https://doi.org/10.1093/bib/bbt086.CrossrefPubMedPubMedCentral
Tarazona, Sonia, Pedro Furió-Tarí, David Turrà, Antonio Di Pietro, María José Nueda, Alberto Ferrer, and Ana Conesa. 2015. Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Research 43 (21): e140–e140. https://doi.org/10.1093/nar/gkv711. https://pubmed.ncbi.nlm.nih.gov/26184878, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4666377/.CrossrefPubMedPubMedCentral
Thomas, C.W., and John Aitchison. 2006. Log-ratios and geochemical discrimination of Scottish Dalradian limestones: A case study. Geological Society, London, Special Publications 264 (1): 25–41.
Tsilimigras, Matthew C.B., and Anthony A. Fodor. 2016. Compositional data analysis of the microbiome: Fundamentals, tools, and challenges. Annals of Epidemiology 26 (5): 330–335. https://doi.org/10.1016/j.annepidem.2016.03.002, http://www.sciencedirect.com/science/article/pii/S1047279716300722.CrossrefPubMed
Urbaniak, Camilla, Michelle Angelini, Gregory B. Gloor, and Gregor Reid. 2016. Human milk microbiota profiles in relation to birthing method, gestation and infant gender. Microbiome 4 (1): 1. https://doi.org/10.1186/s40168-015-0145-y.CrossrefPubMedPubMedCentral
van den Boogaart, K. Gerald, and Raimon Tolosana-Delgado. 2008. “Compositions”: A unified R package to analyze compositional data. Computers & Geosciences 34 (4): 320–338. https://doi.org/10.1016/j.cageo.2006.11.017, https://www.sciencedirect.com/science/article/pii/S009830040700101X.Crossref
van den Boogaart, G.K., and R. Tolosana-Delgado. 2013a. Analyzing compositional data with R. Berlin/Heidelberg: Springer-Verlag.
van den Boogaart, K. Gerald, and Raimon Tolosana-Delgado. 2013b. Fundamental concepts of compositional data analysis. In Analyzing compositional data with R, 13–50. Springer.
Wallen, Zachary D. 2021. Comparison study of differential abundance testing methods using two large Parkinson disease gut microbiome datasets derived from 16S amplicon sequencing. BMC Bioinformatics 22 (1): 1–29.
Wang, Shulei. 2021. Robust differential abundance test in compositional data. arXiv preprint arXiv: 2101.08765.
Washburne, Alex D., Justin D. Silverman, Jonathan W. Leff, Dominic J. Bennett, John L. Darcy, Sayan Mukherjee, Noah Fierer, and Lawrence A. David. 2017. Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets. PeerJ 5: e2969. https://doi.org/10.7717/peerj.2969.CrossrefPubMedPubMedCentral
Webster’s New World Dictionary, ed. 2005. Webster’s II new college dictionary. Houghton Mifflin Harcourt. p.236
Weiss, Sophie, Xu Zhenjiang Zech, Shyamal Peddada, Amnon Amir, Kyle Bittinger, Antonio Gonzalez, Catherine Lozupone, Jesse R. Zaneveld, Yoshiki Vázquez-Baeza, Amanda Birmingham, Embriette R. Hyde, and Rob Knight. 2017. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5 (1): 27. https://doi.org/10.1186/s40168-017-0237-y.CrossrefPubMedPubMedCentral
Williams, D.A. 1971. A test for differences between treatment means when several dose levels are compared with a zero dose control. Biometrics 27: 103–117.PubMed
Williams, David A. 1977. Some inference procedures for monotonically ordered normal means. Biometrika 64 (1): 9–14.
Williams, Claire R., Alyssa Baccarella, Jay Z. Parrish, and Charles C. Kim. 2017. Empirical assessment of analysis workflows for differential expression analysis of human samples using RNA-Seq. BMC Bioinformatics 18 (1): 38. https://doi.org/10.1186/s12859-016-1457-z.CrossrefPubMedPubMedCentral
Xia, Y. 2020. Correlation and association analyses in microbiome study integrating multiomics in health and disease. Progress in Molecular Biology and Translational Science 171: 309–491. https://doi.org/10.1016/bs.pmbts.2020.04.003.CrossrefPubMed
Xia, Y., and J. Sun. 2022a. An integrated analysis of microbiomes and metabolomics. American Chemical Society.
———. 2022b. Statistical data analysis of microbiomes and metabolomics. American Chemical Society.
Xia, Yinglin, Jun Sun, and Ding-Geng Chen. 2018a. Compositional analysis of microbiome data. In Statistical analysis of microbiome data with R, 331–393. Singapore: Springer Singapore.
———. 2018b. Modeling zero-inflated microbiome data. In Statistical analysis of microbiome data with R, ed. Yinglin Xia, Jun Sun, and Ding-Geng Chen, 453–496. Singapore: Springer Singapore.
———. 2018c. What are microbiome data? In Statistical analysis of microbiome data with R, ed. Yinglin Xia, Jun Sun, and Ding-Geng Chen, 29–41. Singapore: Springer Singapore.
Zhang, X., and N. Yi. 2020. Fast zero-inflated negative binomial mixed modeling approach for analyzing longitudinal metagenomics data. Bioinformatics 36 (8): 2345–2351. https://doi.org/10.1093/bioinformatics/btz973.CrossrefPubMed
Zhang, Yilong, Sung Won Han, Laura M. Cox, and Huilin Li. 2017. A multivariate distance-based analytic framework for microbial interdependence association test in longitudinal study. Genetic Epidemiology 41 (8): 769–778. https://doi.org/10.1002/gepi.22065, https://www.ncbi.nlm.nih.gov/pubmed/28872698, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5696116/.CrossrefPubMedPubMedCentral